Agentic AI Self-Healing: Error Recovery and Retry Pattern Development
Build fault-tolerant agentic AI with circuit breakers, exponential backoff, fallback routing, error classification, and self-correction loops.
Why Agent Systems Must Self-Heal
Agent systems depend on a chain of external services: LLM APIs, databases, third-party APIs, vector stores, speech services. Any link in the chain can fail — and in production, they will fail. The question is not whether your agent will encounter errors, but how gracefully it handles them.
A self-healing agent system detects failures, classifies them, applies the appropriate recovery strategy, and continues operating with minimal user impact. This is fundamentally different from traditional error handling (catch and log) because agents must reason about errors, adapt their behavior, and find alternative paths to accomplish the user's goal.
This guide covers circuit breakers for LLM APIs, exponential backoff, fallback agent routing, error classification, self-correction loops, and graceful degradation.
Circuit Breaker Pattern for LLM APIs
The circuit breaker pattern prevents cascading failures by stopping requests to a failing service before they pile up. When an LLM API starts returning errors, continuing to send requests wastes time, money, and creates a queue of users waiting for responses that will never come.
Implementation
import time
from enum import Enum
class CircuitState(str, Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Blocking requests
HALF_OPEN = "half_open" # Testing recovery
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: float = 30.0,
half_open_max_calls: int = 3,
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = 0.0
self.half_open_calls = 0
def can_execute(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.half_open_calls = 0
return True
return False
if self.state == CircuitState.HALF_OPEN:
return self.half_open_calls < self.half_open_max_calls
return False
def record_success(self):
if self.state == CircuitState.HALF_OPEN:
self.half_open_calls += 1
if self.half_open_calls >= self.half_open_max_calls:
# Recovery confirmed
self.state = CircuitState.CLOSED
self.failure_count = 0
elif self.state == CircuitState.CLOSED:
self.failure_count = 0
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
# Recovery failed
self.state = CircuitState.OPEN
elif self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
# Usage with LLM API
class ResilientLLMClient:
def __init__(self):
self.breakers = {
"openai": CircuitBreaker(),
"anthropic": CircuitBreaker(),
}
async def chat(self, provider: str, **kwargs):
breaker = self.breakers[provider]
if not breaker.can_execute():
raise CircuitOpenError(
f"{provider} circuit is open. "
f"Recovery in {breaker.recovery_timeout}s."
)
try:
result = await self._call_provider(provider, **kwargs)
breaker.record_success()
return result
except (APIError, TimeoutError) as e:
breaker.record_failure()
raise
Per-Model Circuit Breakers
In multi-model setups, maintain separate circuit breakers per model. A failure in GPT-4o should not affect Claude requests. This isolation prevents a single provider outage from taking down the entire system.
Exponential Backoff with Jitter
When retrying failed requests, exponential backoff prevents overwhelming a recovering service. Jitter prevents the thundering herd problem where many clients retry simultaneously.
import random
import asyncio
class ExponentialBackoff:
def __init__(
self,
base_delay: float = 1.0,
max_delay: float = 60.0,
max_retries: int = 5,
jitter: bool = True,
):
self.base_delay = base_delay
self.max_delay = max_delay
self.max_retries = max_retries
self.jitter = jitter
async def execute(self, func, *args, **kwargs):
last_exception = None
for attempt in range(self.max_retries):
try:
return await func(*args, **kwargs)
except RetryableError as e:
last_exception = e
if attempt < self.max_retries - 1:
delay = min(
self.base_delay * (2 ** attempt),
self.max_delay
)
if self.jitter:
delay = delay * (0.5 + random.random())
await asyncio.sleep(delay)
raise MaxRetriesExceeded(
f"Failed after {self.max_retries} attempts"
) from last_exception
Fallback Agent Routing
When the primary LLM provider fails, route to a fallback. This requires maintaining compatible agent configurations across multiple providers.
class FallbackRouter:
def __init__(self):
self.providers = [
{
"name": "openai",
"model": "gpt-4o",
"client": openai_client,
"breaker": CircuitBreaker(),
"priority": 1,
},
{
"name": "anthropic",
"model": "claude-sonnet-4-20250514",
"client": anthropic_client,
"breaker": CircuitBreaker(),
"priority": 2,
},
{
"name": "google",
"model": "gemini-1.5-pro",
"client": google_client,
"breaker": CircuitBreaker(),
"priority": 3,
},
]
async def execute(self, messages: list[dict], **kwargs) -> str:
errors = []
for provider in sorted(self.providers, key=lambda p: p["priority"]):
if not provider["breaker"].can_execute():
continue
try:
result = await self._call_provider(provider, messages, **kwargs)
provider["breaker"].record_success()
return result
except Exception as e:
provider["breaker"].record_failure()
errors.append(f"{provider['name']}: {e}")
raise AllProvidersFailedError(
f"All LLM providers failed: {'; '.join(errors)}"
)
Provider-Specific Prompt Adaptation
Different models respond differently to the same prompt. When falling back to an alternative provider, you may need to adapt the prompt. Maintain provider-specific prompt variations for critical differences (such as system message handling, tool calling format, or output formatting preferences).
Error Classification: Retryable vs Fatal
Not all errors should trigger retries. Classifying errors correctly prevents wasting time retrying errors that will never succeed.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class ErrorClassifier:
RETRYABLE_ERRORS = {
429: "rate_limit", # Rate limited — retry after backoff
500: "server_error", # Server error — may recover
502: "bad_gateway", # Infrastructure — usually transient
503: "service_unavailable", # Service down — may recover
504: "gateway_timeout", # Timeout — may recover
}
FATAL_ERRORS = {
400: "bad_request", # Our request is malformed
401: "unauthorized", # Invalid API key
403: "forbidden", # Permission denied
404: "not_found", # Resource does not exist
422: "validation_error", # Input validation failed
}
@classmethod
def classify(cls, error) -> dict:
if hasattr(error, "status_code"):
code = error.status_code
if code in cls.RETRYABLE_ERRORS:
return {
"retryable": True,
"category": cls.RETRYABLE_ERRORS[code],
"strategy": "exponential_backoff",
}
if code in cls.FATAL_ERRORS:
return {
"retryable": False,
"category": cls.FATAL_ERRORS[code],
"strategy": "fail_fast",
}
if isinstance(error, TimeoutError):
return {
"retryable": True,
"category": "timeout",
"strategy": "retry_with_longer_timeout",
}
if isinstance(error, ConnectionError):
return {
"retryable": True,
"category": "connection",
"strategy": "exponential_backoff",
}
return {
"retryable": False,
"category": "unknown",
"strategy": "fail_fast",
}
Self-Correction Loops
Beyond retrying external failures, agents can self-correct when their own output is wrong. This is distinct from retry — the request succeeded but the result was incorrect.
Output Validation and Correction
class SelfCorrectingAgent:
def __init__(self, llm_client, max_corrections: int = 3):
self.llm = llm_client
self.max_corrections = max_corrections
async def execute_with_correction(
self,
messages: list[dict],
validators: list[callable],
) -> str:
response = await self.llm.chat(messages=messages)
for correction_attempt in range(self.max_corrections):
validation_errors = []
for validator in validators:
error = validator(response)
if error:
validation_errors.append(error)
if not validation_errors:
return response
# Ask agent to self-correct
error_description = "\n".join(
f"- {e}" for e in validation_errors
)
messages.append({"role": "assistant", "content": response})
messages.append({
"role": "user",
"content": (
f"Your response has the following issues:\n"
f"{error_description}\n\n"
f"Please correct your response."
),
})
response = await self.llm.chat(messages=messages)
# Return last response even if not perfect
return response
Validators for Common Error Types
def validate_no_hallucinated_urls(response: str) -> str | None:
"""Check that URLs in the response are from approved domains."""
url_pattern = r"https?://[\w\-\.]+\.[a-z]{2,}"
urls = re.findall(url_pattern, response)
approved = {"docs.example.com", "api.example.com", "example.com"}
bad_urls = [u for u in urls if not any(d in u for d in approved)]
if bad_urls:
return f"Response contains unapproved URLs: {bad_urls}"
return None
def validate_sql_safety(response: str) -> str | None:
"""Check that generated SQL does not contain dangerous operations."""
dangerous = ["DROP", "DELETE", "TRUNCATE", "ALTER", "GRANT"]
upper_response = response.upper()
found = [d for d in dangerous if d in upper_response]
if found:
return f"Response contains dangerous SQL: {found}"
return None
Graceful Degradation Strategies
When components fail and cannot be recovered, the system should degrade gracefully rather than crash entirely.
Feature-Level Degradation
class AgentCapabilities:
def __init__(self):
self.capabilities = {
"rag_search": True,
"tool_execution": True,
"voice_output": True,
"image_analysis": True,
}
def disable(self, capability: str, reason: str):
self.capabilities[capability] = False
logger.warning(f"Capability '{capability}' disabled: {reason}")
def get_system_prompt_suffix(self) -> str:
disabled = [k for k, v in self.capabilities.items() if not v]
if not disabled:
return ""
return (
"\n\nNOTE: The following capabilities are currently "
"unavailable due to temporary issues: "
+ ", ".join(disabled)
+ ". Inform the user if they request something that "
"requires these capabilities."
)
When the vector database is down, disable RAG search and inform the agent. The agent can still answer questions from its training knowledge and tell users that knowledge base search is temporarily unavailable. This is vastly better than the entire agent going offline.
Timeout-Based Degradation
Set timeouts for each external dependency. When a timeout fires, proceed without that dependency's input rather than blocking indefinitely.
async def get_enriched_context(
user_query: str,
timeout: float = 3.0,
) -> dict:
context = {"user_query": user_query}
# Try to get RAG context, but don't block on it
try:
rag_results = await asyncio.wait_for(
search_knowledge_base(user_query),
timeout=timeout,
)
context["rag_results"] = rag_results
except asyncio.TimeoutError:
context["rag_results"] = None
context["degraded"] = True
logger.warning("RAG search timed out, proceeding without context")
return context
Monitoring Self-Healing Behavior
Track how often self-healing mechanisms activate to understand system health and identify recurring issues.
Key metrics to monitor include circuit breaker state transitions (how often breakers open, how long they stay open), retry rates per provider and error type, fallback activation frequency, self-correction attempts and success rates, and degraded capability duration.
class ResilienceMetrics:
async def record_circuit_event(
self, provider: str, from_state: str, to_state: str
):
await self.emit("circuit_breaker_transition", {
"provider": provider,
"from": from_state,
"to": to_state,
"timestamp": datetime.utcnow().isoformat(),
})
async def record_retry(
self, operation: str, attempt: int, error_category: str
):
await self.emit("retry_attempt", {
"operation": operation,
"attempt": attempt,
"error_category": error_category,
})
async def record_fallback(
self, from_provider: str, to_provider: str, reason: str
):
await self.emit("fallback_activated", {
"from": from_provider,
"to": to_provider,
"reason": reason,
})
Frequently Asked Questions
What is the difference between retry and self-healing in agent systems?
Retry addresses transient external failures — the same request is sent again hoping the service has recovered. Self-healing is broader: it includes retry but also self-correction (the agent fixes its own output), fallback routing (switching to alternative providers), graceful degradation (operating with reduced capabilities), and circuit breaking (proactively stopping requests to failing services). Self-healing agents adapt their behavior based on the error, not just repeat the same action.
How do you decide when to use a circuit breaker versus simple retry?
Use simple retry for isolated errors that are likely transient (one timeout in an otherwise healthy service). Use circuit breakers when errors indicate systemic issues (multiple consecutive failures suggesting the service is down). A good rule of thumb: if you have retried 3-5 times and the service is still failing, the circuit breaker should open to prevent further wasted requests and give the service time to recover.
Should fallback LLM providers use the same prompts?
Ideally yes, but in practice, different models may need prompt adjustments. Maintain a primary prompt and a set of provider-specific overrides for known behavioral differences. Test your prompts on all fallback providers proactively — do not discover prompt incompatibilities during an outage when the fallback is actually needed.
How do you prevent self-correction loops from running indefinitely?
Always set a maximum correction attempt limit (typically 2-3 attempts). After the limit, accept the best available response or escalate to a human. Track self-correction rates: if an agent frequently needs corrections, the underlying prompt or tool configuration likely needs improvement rather than relying on self-correction as a crutch.
What is graceful degradation in the context of agentic AI?
Graceful degradation means the agent continues to function with reduced capabilities when components fail, rather than failing entirely. If the knowledge base is down, the agent still answers from its training data. If TTS fails, the agent provides text responses instead of voice. If the primary LLM is unavailable, it falls back to an alternative model. The key principle is that partial service is always better than no service.
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.