Overview

The Agent Triage Protocol defines standard error responses that enable consistent error handling across implementations. This page describes the error response format, standard error codes, and best practices for handling errors in ATP implementations.

Error Response Format

Error responses utilize a structured format that provides both machine-readable codes and human-understandable messages. The error object contains:

{
  "code": "NOTIFICATION_EXPIRED",
  "message": "The notification deadline has passed and no longer accepts responses",
  "details": {
    "notification_id": "550e8400-e29b-41d4-a716-446655440000",
    "expired_at": "2025-05-25T11:00:00Z"
  },
  "request_id": "req_abc123def456"
}

Field	Type	Description
`code`	string	Standardized error identifier for programmatic handling
`message`	string	Human-readable error description
`details`	object	Additional context specific to the error type
`request_id`	string	Unique identifier for request tracing

The request_id field is particularly important for troubleshooting, as it allows correlation of error reports across system boundaries and log files.

HTTP Status Codes

The protocol uses standard HTTP status codes to indicate the class of error:

Status Code	Description	When Used
`400` Bad Request	Invalid request format or parameters	Malformed JSON, missing required fields
`401` Unauthorized	Missing or invalid authentication	Invalid or expired API key/token
`403` Forbidden	Valid auth but insufficient permissions	Attempting to access another service’s notifications
`404` Not Found	Resource doesn’t exist	Notification ID not found
`409` Conflict	Request conflicts with current state	Responding to already-answered notification
`422` Unprocessable Entity	Request validation failed	Response data doesn’t match expected format
`429` Too Many Requests	Rate limit exceeded	Too many requests in time period
`500` Internal Server Error	Server-side failure	Unexpected errors in ATP server
`503` Service Unavailable	Temporary service issues	Server maintenance or overload

Error Codes

The protocol defines specific error codes that provide more detail than HTTP status alone. These codes allow client applications to implement specific handling logic for different error conditions.

Authentication Errors

Code	Description
`AUTH_INVALID_TOKEN`	The provided token is malformed or invalid
`AUTH_EXPIRED_TOKEN`	The authentication token has expired
`AUTH_INSUFFICIENT_PERMISSIONS`	Token lacks required permissions

Notification Errors

Code	Description
`NOTIFICATION_NOT_FOUND`	Notification doesn’t exist or is no longer accessible
`NOTIFICATION_EXPIRED`	Notification deadline has passed
`NOTIFICATION_ALREADY_RESPONDED`	Notification has already been answered
`NOTIFICATION_INVALIDATED`	Service marked notification as invalid

Validation Errors

Code	Description
`INVALID_ACTION_ID`	The specified `action_id` doesn’t exist for this notification
`INVALID_RESPONSE_DATA`	Response data doesn’t match expected format
`CONSTRAINT_VIOLATION`	Response violates defined constraints
`MISSING_REQUIRED_FIELD`	Required field is missing from request

Service Errors

Code	Description
`SERVICE_NOT_REGISTERED`	Service hasn’t been registered with ATP
`SERVICE_SUSPENDED`	Service has been temporarily suspended
`CALLBACK_FAILED`	Failed to deliver response to service callback

Rate Limiting

Code	Description
`RATE_LIMIT_EXCEEDED`	Too many requests from this client/service
`QUOTA_EXCEEDED`	Monthly/daily quota has been exceeded

Client Error Handling

Robust client implementations must incorporate comprehensive error handling strategies to ensure reliable operation in production environments. The protocol distinguishes between transient failures that warrant retry attempts and permanent errors that require user intervention or alternative action.

Transient vs. Permanent Errors

Transient errors are temporary issues that may resolve with time or retries:

All 5xx series errors
Rate limiting (429) responses
Network connectivity issues
Webhook delivery failures

Permanent errors indicate fundamental problems that won’t be resolved by retrying:

Authentication errors (except token expiration)
Resource not found errors
Validation errors
Business logic errors (e.g., notification already responded)

Retry Strategies

For temporary failures, clients should implement exponential backoff retry strategies:

Initial retry delay should begin at one second
Double the delay with each subsequent attempt
Add small random jitter to prevent thundering herd problems
Cap maximum delay at 60 seconds
Limit total retry attempts (typically 3-5 is reasonable)

function calculateRetryDelay(attempt) {
  // Start with 1000ms delay and double each time
  const baseDelay = Math.min(1000 * Math.pow(2, attempt), 60000);
  
  // Add jitter (±10% of base delay)
  const jitter = baseDelay * 0.1 * (Math.random() * 2 - 1);
  
  return baseDelay + jitter;
}

User Feedback

Client applications should provide appropriate feedback to users based on error types:

For transient errors, show a temporary “retrying” message
For permanent errors, show clear explanation of the issue
For validation errors, highlight the specific fields with problems
For expired or invalidated notifications, remove them from the UI
For authentication issues, prompt for re-authentication

Service Callback Errors

When the ATP server delivers responses to service webhook endpoints, services may encounter processing errors that prevent successful handling of user decisions. Services should communicate these errors using a consistent format that enables appropriate ATP server behavior.

{
  "code": "RESOURCE_LOCKED",
  "message": "Cannot apply changes because resource is currently locked by another operation",
  "user_message": "The system is currently processing another change. Please try again in a few moments.",
  "retriable": true
}

Field	Type	Description
`code`	string	Service-specific error identifier
`message`	string	Technical error description for logging
`user_message`	string	Human-readable message for potential user display
`retriable`	boolean	Indicates whether retry attempts may succeed

The retriable field is particularly important as it tells the ATP server whether it should attempt to deliver the response again later. If set to false, the ATP server will not retry and may notify the user that their response could not be processed.

Logging and Monitoring

Robust ATP implementations should include comprehensive logging and monitoring for error conditions:

Log all errors with their request IDs
Include contextual information in logs (user ID, service ID, notification ID)
Monitor error rates by type and service
Set up alerts for unusual error patterns
Implement distributed tracing for complex deployments

When logging errors, be careful not to include sensitive information:

Never log authentication tokens
Redact personal information from error logs
Sanitize potentially sensitive fields in error details

Error Handling Examples

Client-Side Error Handling (TypeScript)

async function submitResponse(
  notification: Notification,
  actionId: string,
  responseData: any
): Promise<void> {
  const MAX_RETRIES = 3;
  let attempt = 0;
  
  while (attempt <= MAX_RETRIES) {
    try {
      const response = await fetch('/api/v1/client/respond', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          'Authorization': `Bearer ${userToken}`
        },
        body: JSON.stringify({
          notification_id: notification.id,
          action_id: actionId,
          response_data: responseData
        })
      });
      
      if (response.ok) {
        return; // Success!
      }
      
      const errorData = await response.json();
      
      // Handle specific error cases
      switch (errorData.code) {
        case 'NOTIFICATION_EXPIRED':
        case 'NOTIFICATION_INVALIDATED':
        case 'NOTIFICATION_ALREADY_RESPONDED':
          // Terminal state, remove from UI
          removeNotification(notification.id);
          showUserMessage(errorData.message);
          return;
          
        case 'INVALID_RESPONSE_DATA':
        case 'CONSTRAINT_VIOLATION':
          // Validation error, show specific feedback
          showValidationError(errorData.details);
          return;
          
        case 'AUTH_EXPIRED_TOKEN':
          // Try to refresh the token
          await refreshUserToken();
          attempt++; // Don't count against retry limit
          continue;
          
        case 'RATE_LIMIT_EXCEEDED':
          // Get retry delay from headers
          const retryAfter = response.headers.get('Retry-After');
          const delayMs = retryAfter ? parseInt(retryAfter) * 1000 : calculateRetryDelay(attempt);
          await delay(delayMs);
          attempt++;
          continue;
      }
      
      // Server errors (5xx) are retryable
      if (response.status >= 500) {
        await delay(calculateRetryDelay(attempt));
        attempt++;
        continue;
      }
      
      // Other errors are considered permanent
      showUserMessage(`Error: ${errorData.message}`);
      return;
      
    } catch (error) {
      // Network errors are retryable
      if (error instanceof NetworkError) {
        await delay(calculateRetryDelay(attempt));
        attempt++;
        continue;
      }
      
      // Other exceptions are unexpected and should be logged
      logError('Unexpected error during response submission', error);
      showUserMessage('An unexpected error occurred. Please try again later.');
      return;
    }
  }
  
  // If we've exhausted retries
  showUserMessage('Unable to submit your response due to network issues. Please try again later.');
}

Service-Side Webhook Error Handling (Python)

@app.route('/atp/webhook', methods=['POST'])
def handle_atp_webhook():
    # Verify webhook signature
    signature = request.headers.get('X-ATP-Signature')
    if not verify_signature(request.data, signature, webhook_secret):
        return jsonify({
            'code': 'INVALID_SIGNATURE',
            'message': 'Invalid webhook signature',
            'retriable': False
        }), 401
    
    data = request.json
    notification_id = data['notification_id']
    action_id = data['action_id']
    response_data = data.get('response_data')
    
    try:
        # Retrieve context for this notification
        context = get_notification_context(notification_id)
        if not context:
            return jsonify({
                'code': 'UNKNOWN_NOTIFICATION',
                'message': 'No context found for this notification',
                'retriable': False
            }), 404
        
        # Process the response based on action type
        if action_id == 'approve_deployment':
            try:
                result = process_deployment_approval(context, response_data)
                return jsonify({'status': 'success', 'result': result})
            except ResourceLockedException:
                return jsonify({
                    'code': 'RESOURCE_LOCKED',
                    'message': 'Deployment resource is currently locked',
                    'user_message': 'Another deployment is in progress. Please try again later.',
                    'retriable': True
                }), 409
        
        # Other action handlers...
        
    except TemporaryFailure as e:
        # Log the error with request ID for tracing
        logger.error(f"Temporary failure processing webhook: {str(e)}", 
                    extra={'request_id': request.headers.get('X-Request-ID')})
        
        # Return 503 to trigger retry with backoff
        return jsonify({
            'code': 'TEMPORARY_FAILURE',
            'message': str(e),
            'retriable': True
        }), 503
        
    except PermanentFailure as e:
        # Log the permanent error
        logger.error(f"Permanent failure processing webhook: {str(e)}",
                    extra={'request_id': request.headers.get('X-Request-ID')})
        
        # Return 422 to indicate the request was valid but couldn't be processed
        return jsonify({
            'code': e.code,
            'message': str(e),
            'user_message': e.user_message,
            'retriable': False
        }), 422
        
    except Exception as e:
        # Unexpected errors should be logged with full details
        logger.exception(f"Unexpected error processing webhook",
                       extra={'request_id': request.headers.get('X-Request-ID')})
        
        # Return 500 with minimal details to avoid leaking implementation details
        return jsonify({
            'code': 'INTERNAL_ERROR',
            'message': 'An unexpected error occurred',
            'retriable': True
        }), 500

By following these error handling patterns, ATP implementations can provide robust, user-friendly experiences even when things go wrong.

Fundamentals

​Overview

​Error Response Format

​HTTP Status Codes

​Error Codes

​Authentication Errors

​Notification Errors

​Validation Errors

​Service Errors

​Rate Limiting

​Client Error Handling

​Transient vs. Permanent Errors

​Retry Strategies

​User Feedback

​Service Callback Errors

​Logging and Monitoring

​Error Handling Examples

​Client-Side Error Handling (TypeScript)

​Service-Side Webhook Error Handling (Python)