Skip to main content

Overview

The Agent Triage Protocol defines standard error responses that enable consistent error handling across implementations. This page describes the error response format, standard error codes, and best practices for handling errors in ATP implementations.

Error Response Format

Error responses utilize a structured format that provides both machine-readable codes and human-understandable messages. The error object contains:
{
  "code": "NOTIFICATION_EXPIRED",
  "message": "The notification deadline has passed and no longer accepts responses",
  "details": {
    "notification_id": "550e8400-e29b-41d4-a716-446655440000",
    "expired_at": "2025-05-25T11:00:00Z"
  },
  "request_id": "req_abc123def456"
}
FieldTypeDescription
codestringStandardized error identifier for programmatic handling
messagestringHuman-readable error description
detailsobjectAdditional context specific to the error type
request_idstringUnique identifier for request tracing
The request_id field is particularly important for troubleshooting, as it allows correlation of error reports across system boundaries and log files.

HTTP Status Codes

The protocol uses standard HTTP status codes to indicate the class of error:
Status CodeDescriptionWhen Used
400 Bad RequestInvalid request format or parametersMalformed JSON, missing required fields
401 UnauthorizedMissing or invalid authenticationInvalid or expired API key/token
403 ForbiddenValid auth but insufficient permissionsAttempting to access another service’s notifications
404 Not FoundResource doesn’t existNotification ID not found
409 ConflictRequest conflicts with current stateResponding to already-answered notification
422 Unprocessable EntityRequest validation failedResponse data doesn’t match expected format
429 Too Many RequestsRate limit exceededToo many requests in time period
500 Internal Server ErrorServer-side failureUnexpected errors in ATP server
503 Service UnavailableTemporary service issuesServer maintenance or overload

Error Codes

The protocol defines specific error codes that provide more detail than HTTP status alone. These codes allow client applications to implement specific handling logic for different error conditions.

Authentication Errors

CodeDescription
AUTH_INVALID_TOKENThe provided token is malformed or invalid
AUTH_EXPIRED_TOKENThe authentication token has expired
AUTH_INSUFFICIENT_PERMISSIONSToken lacks required permissions

Notification Errors

CodeDescription
NOTIFICATION_NOT_FOUNDNotification doesn’t exist or is no longer accessible
NOTIFICATION_EXPIREDNotification deadline has passed
NOTIFICATION_ALREADY_RESPONDEDNotification has already been answered
NOTIFICATION_INVALIDATEDService marked notification as invalid

Validation Errors

CodeDescription
INVALID_ACTION_IDThe specified action_id doesn’t exist for this notification
INVALID_RESPONSE_DATAResponse data doesn’t match expected format
CONSTRAINT_VIOLATIONResponse violates defined constraints
MISSING_REQUIRED_FIELDRequired field is missing from request

Service Errors

CodeDescription
SERVICE_NOT_REGISTEREDService hasn’t been registered with ATP
SERVICE_SUSPENDEDService has been temporarily suspended
CALLBACK_FAILEDFailed to deliver response to service callback

Rate Limiting

CodeDescription
RATE_LIMIT_EXCEEDEDToo many requests from this client/service
QUOTA_EXCEEDEDMonthly/daily quota has been exceeded

Client Error Handling

Robust client implementations must incorporate comprehensive error handling strategies to ensure reliable operation in production environments. The protocol distinguishes between transient failures that warrant retry attempts and permanent errors that require user intervention or alternative action.

Transient vs. Permanent Errors

Transient errors are temporary issues that may resolve with time or retries:
  • All 5xx series errors
  • Rate limiting (429) responses
  • Network connectivity issues
  • Webhook delivery failures
Permanent errors indicate fundamental problems that won’t be resolved by retrying:
  • Authentication errors (except token expiration)
  • Resource not found errors
  • Validation errors
  • Business logic errors (e.g., notification already responded)

Retry Strategies

For temporary failures, clients should implement exponential backoff retry strategies:
  1. Initial retry delay should begin at one second
  2. Double the delay with each subsequent attempt
  3. Add small random jitter to prevent thundering herd problems
  4. Cap maximum delay at 60 seconds
  5. Limit total retry attempts (typically 3-5 is reasonable)
function calculateRetryDelay(attempt) {
  // Start with 1000ms delay and double each time
  const baseDelay = Math.min(1000 * Math.pow(2, attempt), 60000);
  
  // Add jitter (±10% of base delay)
  const jitter = baseDelay * 0.1 * (Math.random() * 2 - 1);
  
  return baseDelay + jitter;
}

User Feedback

Client applications should provide appropriate feedback to users based on error types:
  1. For transient errors, show a temporary “retrying” message
  2. For permanent errors, show clear explanation of the issue
  3. For validation errors, highlight the specific fields with problems
  4. For expired or invalidated notifications, remove them from the UI
  5. For authentication issues, prompt for re-authentication

Service Callback Errors

When the ATP server delivers responses to service webhook endpoints, services may encounter processing errors that prevent successful handling of user decisions. Services should communicate these errors using a consistent format that enables appropriate ATP server behavior.
{
  "code": "RESOURCE_LOCKED",
  "message": "Cannot apply changes because resource is currently locked by another operation",
  "user_message": "The system is currently processing another change. Please try again in a few moments.",
  "retriable": true
}
FieldTypeDescription
codestringService-specific error identifier
messagestringTechnical error description for logging
user_messagestringHuman-readable message for potential user display
retriablebooleanIndicates whether retry attempts may succeed
The retriable field is particularly important as it tells the ATP server whether it should attempt to deliver the response again later. If set to false, the ATP server will not retry and may notify the user that their response could not be processed.

Logging and Monitoring

Robust ATP implementations should include comprehensive logging and monitoring for error conditions:
  1. Log all errors with their request IDs
  2. Include contextual information in logs (user ID, service ID, notification ID)
  3. Monitor error rates by type and service
  4. Set up alerts for unusual error patterns
  5. Implement distributed tracing for complex deployments
When logging errors, be careful not to include sensitive information:
  • Never log authentication tokens
  • Redact personal information from error logs
  • Sanitize potentially sensitive fields in error details

Error Handling Examples

Client-Side Error Handling (TypeScript)

async function submitResponse(
  notification: Notification,
  actionId: string,
  responseData: any
): Promise<void> {
  const MAX_RETRIES = 3;
  let attempt = 0;
  
  while (attempt <= MAX_RETRIES) {
    try {
      const response = await fetch('/api/v1/client/respond', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
          'Authorization': `Bearer ${userToken}`
        },
        body: JSON.stringify({
          notification_id: notification.id,
          action_id: actionId,
          response_data: responseData
        })
      });
      
      if (response.ok) {
        return; // Success!
      }
      
      const errorData = await response.json();
      
      // Handle specific error cases
      switch (errorData.code) {
        case 'NOTIFICATION_EXPIRED':
        case 'NOTIFICATION_INVALIDATED':
        case 'NOTIFICATION_ALREADY_RESPONDED':
          // Terminal state, remove from UI
          removeNotification(notification.id);
          showUserMessage(errorData.message);
          return;
          
        case 'INVALID_RESPONSE_DATA':
        case 'CONSTRAINT_VIOLATION':
          // Validation error, show specific feedback
          showValidationError(errorData.details);
          return;
          
        case 'AUTH_EXPIRED_TOKEN':
          // Try to refresh the token
          await refreshUserToken();
          attempt++; // Don't count against retry limit
          continue;
          
        case 'RATE_LIMIT_EXCEEDED':
          // Get retry delay from headers
          const retryAfter = response.headers.get('Retry-After');
          const delayMs = retryAfter ? parseInt(retryAfter) * 1000 : calculateRetryDelay(attempt);
          await delay(delayMs);
          attempt++;
          continue;
      }
      
      // Server errors (5xx) are retryable
      if (response.status >= 500) {
        await delay(calculateRetryDelay(attempt));
        attempt++;
        continue;
      }
      
      // Other errors are considered permanent
      showUserMessage(`Error: ${errorData.message}`);
      return;
      
    } catch (error) {
      // Network errors are retryable
      if (error instanceof NetworkError) {
        await delay(calculateRetryDelay(attempt));
        attempt++;
        continue;
      }
      
      // Other exceptions are unexpected and should be logged
      logError('Unexpected error during response submission', error);
      showUserMessage('An unexpected error occurred. Please try again later.');
      return;
    }
  }
  
  // If we've exhausted retries
  showUserMessage('Unable to submit your response due to network issues. Please try again later.');
}

Service-Side Webhook Error Handling (Python)

@app.route('/atp/webhook', methods=['POST'])
def handle_atp_webhook():
    # Verify webhook signature
    signature = request.headers.get('X-ATP-Signature')
    if not verify_signature(request.data, signature, webhook_secret):
        return jsonify({
            'code': 'INVALID_SIGNATURE',
            'message': 'Invalid webhook signature',
            'retriable': False
        }), 401
    
    data = request.json
    notification_id = data['notification_id']
    action_id = data['action_id']
    response_data = data.get('response_data')
    
    try:
        # Retrieve context for this notification
        context = get_notification_context(notification_id)
        if not context:
            return jsonify({
                'code': 'UNKNOWN_NOTIFICATION',
                'message': 'No context found for this notification',
                'retriable': False
            }), 404
        
        # Process the response based on action type
        if action_id == 'approve_deployment':
            try:
                result = process_deployment_approval(context, response_data)
                return jsonify({'status': 'success', 'result': result})
            except ResourceLockedException:
                return jsonify({
                    'code': 'RESOURCE_LOCKED',
                    'message': 'Deployment resource is currently locked',
                    'user_message': 'Another deployment is in progress. Please try again later.',
                    'retriable': True
                }), 409
        
        # Other action handlers...
        
    except TemporaryFailure as e:
        # Log the error with request ID for tracing
        logger.error(f"Temporary failure processing webhook: {str(e)}", 
                    extra={'request_id': request.headers.get('X-Request-ID')})
        
        # Return 503 to trigger retry with backoff
        return jsonify({
            'code': 'TEMPORARY_FAILURE',
            'message': str(e),
            'retriable': True
        }), 503
        
    except PermanentFailure as e:
        # Log the permanent error
        logger.error(f"Permanent failure processing webhook: {str(e)}",
                    extra={'request_id': request.headers.get('X-Request-ID')})
        
        # Return 422 to indicate the request was valid but couldn't be processed
        return jsonify({
            'code': e.code,
            'message': str(e),
            'user_message': e.user_message,
            'retriable': False
        }), 422
        
    except Exception as e:
        # Unexpected errors should be logged with full details
        logger.exception(f"Unexpected error processing webhook",
                       extra={'request_id': request.headers.get('X-Request-ID')})
        
        # Return 500 with minimal details to avoid leaking implementation details
        return jsonify({
            'code': 'INTERNAL_ERROR',
            'message': 'An unexpected error occurred',
            'retriable': True
        }), 500
By following these error handling patterns, ATP implementations can provide robust, user-friendly experiences even when things go wrong.
I