Skip to Content
GuidesError Handling

Error Handling & Recovery Patterns

Production-grade error handling for MCP servers, FAI Engine, Azure SDK calls, and LLM API interactions.

Error Sources in AI Systems

SourceExampleFrequency
LLM APIRate limits, timeout, content filterHigh
Azure SDKTransient network, auth expiryMedium
MCP transportConnection drop, malformed JSONMedium
User inputPrompt injection, invalid queriesHigh
InfrastructureCold start, memory pressureLow

Pattern 1: Retry with Exponential Backoff

Python

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=30), retry=retry_if_exception_type(TransientError) ) async def call_openai(client, messages, max_tokens=500): try: response = await client.chat.completions.create( model="gpt-4o", messages=messages, max_tokens=max_tokens, timeout=30 ) return response.choices[0].message.content except httpx.HTTPStatusError as e: if e.response.status_code == 429: raise TransientError(f"Rate limited: {e}") if e.response.status_code >= 500: raise TransientError(f"Server error: {e}") raise # Non-retryable

Node.js

async function callOpenAI(client, messages, maxRetries = 3) { for (let attempt = 1; attempt <= maxRetries; attempt++) { try { const response = await client.chat.completions.create({ model: 'gpt-4o', messages, max_tokens: 500, }); return response.choices[0].message.content; } catch (error) { const status = error?.status; if ((status === 429 || status >= 500) && attempt < maxRetries) { const delay = Math.min(1000 * Math.pow(2, attempt), 30000); await new Promise(r => setTimeout(r, delay)); continue; } throw error; } } }

Pattern 2: Circuit Breaker

import time class CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=60): self.failure_count = 0 self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.last_failure_time = 0 self.state = "closed" # closed | open | half-open def call(self, func, *args, **kwargs): if self.state == "open": if time.time() - self.last_failure_time > self.recovery_timeout: self.state = "half-open" else: raise Exception("Circuit breaker OPEN") try: result = func(*args, **kwargs) if self.state == "half-open": self.state = "closed" self.failure_count = 0 return result except Exception: self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = "open" raise

Pattern 3: MCP Server Error Handling

@mcp.tool() async def search_knowledge(query: str, max_results: int = 5) -> str: """Search FROOT knowledge modules.""" if not query or len(query) > 500: return '{"error": "Query must be 1-500 characters"}' try: results = perform_search(query, max_results) return json.dumps({"results": results}) except FileNotFoundError: return json.dumps({"error": "Knowledge base not found"}) except Exception as e: logger.error(f"Search failed: {e}", exc_info=True) return json.dumps({"error": "Search temporarily unavailable"})
⚠️

Never Raise in MCP Tools

MCP tools must return JSON errors, never propagate exceptions. The AI model can’t recover from a crashed tool.

Pattern 4: Timeout Wrapper

function withTimeout(promise, ms, label = 'Operation') { let timer; const timeout = new Promise((_, reject) => { timer = setTimeout(() => reject(new Error(`${label} timed out after ${ms}ms`)), ms); }); return Promise.race([promise, timeout]).finally(() => clearTimeout(timer)); } const result = await withTimeout(callOpenAI(client, messages), 30000, 'Azure OpenAI');

Decision Matrix

Error TypeRetry?FallbackUser Message
429 Rate Limit✅ backoffQueue request”Please wait a moment”
500 Server Error✅ 3 attemptsCached response”Temporarily unavailable”
401 Auth ExpiredRefresh token”Please re-authenticate”
400 Bad RequestFix request”Invalid input: [details]“
Timeout✅ 1 retryCached response”Request took too long”
Content FilterRephrase”Content could not be processed”

Best Practices

  1. Always set max_tokens — prevent token budget overruns
  2. Always set timeouts — no call should wait forever
  3. Retry only transient errors — 429, 500+, network timeouts
  4. Never retry 400/401/403 — these are permanent failures
  5. Log structured JSON — not console.log strings
  6. Include correlation IDs — trace errors across services
  7. Validate at boundaries — MCP tool inputs, API params, user queries
  8. Degrade gracefully — cached response > simpler model > error message

See Also

Last updated on