Skip to content
Technology9 min read0 views

Agentic AI Self-Healing: Error Recovery and Retry Pattern Development

Build fault-tolerant agentic AI with circuit breakers, exponential backoff, fallback routing, error classification, and self-correction loops.

Why Agent Systems Must Self-Heal

Agent systems depend on a chain of external services: LLM APIs, databases, third-party APIs, vector stores, speech services. Any link in the chain can fail — and in production, they will fail. The question is not whether your agent will encounter errors, but how gracefully it handles them.

A self-healing agent system detects failures, classifies them, applies the appropriate recovery strategy, and continues operating with minimal user impact. This is fundamentally different from traditional error handling (catch and log) because agents must reason about errors, adapt their behavior, and find alternative paths to accomplish the user's goal.

This guide covers circuit breakers for LLM APIs, exponential backoff, fallback agent routing, error classification, self-correction loops, and graceful degradation.

Circuit Breaker Pattern for LLM APIs

The circuit breaker pattern prevents cascading failures by stopping requests to a failing service before they pile up. When an LLM API starts returning errors, continuing to send requests wastes time, money, and creates a queue of users waiting for responses that will never come.

Implementation

import time
from enum import Enum

class CircuitState(str, Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Blocking requests
    HALF_OPEN = "half_open"  # Testing recovery

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 30.0,
        half_open_max_calls: int = 3,
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls

        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.half_open_calls = 0

    def can_execute(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True

        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
                return True
            return False

        if self.state == CircuitState.HALF_OPEN:
            return self.half_open_calls < self.half_open_max_calls

        return False

    def record_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.half_open_calls += 1
            if self.half_open_calls >= self.half_open_max_calls:
                # Recovery confirmed
                self.state = CircuitState.CLOSED
                self.failure_count = 0
        elif self.state == CircuitState.CLOSED:
            self.failure_count = 0

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.state == CircuitState.HALF_OPEN:
            # Recovery failed
            self.state = CircuitState.OPEN
        elif self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Usage with LLM API
class ResilientLLMClient:
    def __init__(self):
        self.breakers = {
            "openai": CircuitBreaker(),
            "anthropic": CircuitBreaker(),
        }

    async def chat(self, provider: str, **kwargs):
        breaker = self.breakers[provider]

        if not breaker.can_execute():
            raise CircuitOpenError(
                f"{provider} circuit is open. "
                f"Recovery in {breaker.recovery_timeout}s."
            )

        try:
            result = await self._call_provider(provider, **kwargs)
            breaker.record_success()
            return result
        except (APIError, TimeoutError) as e:
            breaker.record_failure()
            raise

Per-Model Circuit Breakers

In multi-model setups, maintain separate circuit breakers per model. A failure in GPT-4o should not affect Claude requests. This isolation prevents a single provider outage from taking down the entire system.

Exponential Backoff with Jitter

When retrying failed requests, exponential backoff prevents overwhelming a recovering service. Jitter prevents the thundering herd problem where many clients retry simultaneously.

import random
import asyncio

class ExponentialBackoff:
    def __init__(
        self,
        base_delay: float = 1.0,
        max_delay: float = 60.0,
        max_retries: int = 5,
        jitter: bool = True,
    ):
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.max_retries = max_retries
        self.jitter = jitter

    async def execute(self, func, *args, **kwargs):
        last_exception = None

        for attempt in range(self.max_retries):
            try:
                return await func(*args, **kwargs)
            except RetryableError as e:
                last_exception = e
                if attempt < self.max_retries - 1:
                    delay = min(
                        self.base_delay * (2 ** attempt),
                        self.max_delay
                    )
                    if self.jitter:
                        delay = delay * (0.5 + random.random())
                    await asyncio.sleep(delay)

        raise MaxRetriesExceeded(
            f"Failed after {self.max_retries} attempts"
        ) from last_exception

Fallback Agent Routing

When the primary LLM provider fails, route to a fallback. This requires maintaining compatible agent configurations across multiple providers.

class FallbackRouter:
    def __init__(self):
        self.providers = [
            {
                "name": "openai",
                "model": "gpt-4o",
                "client": openai_client,
                "breaker": CircuitBreaker(),
                "priority": 1,
            },
            {
                "name": "anthropic",
                "model": "claude-sonnet-4-20250514",
                "client": anthropic_client,
                "breaker": CircuitBreaker(),
                "priority": 2,
            },
            {
                "name": "google",
                "model": "gemini-1.5-pro",
                "client": google_client,
                "breaker": CircuitBreaker(),
                "priority": 3,
            },
        ]

    async def execute(self, messages: list[dict], **kwargs) -> str:
        errors = []

        for provider in sorted(self.providers, key=lambda p: p["priority"]):
            if not provider["breaker"].can_execute():
                continue

            try:
                result = await self._call_provider(provider, messages, **kwargs)
                provider["breaker"].record_success()
                return result
            except Exception as e:
                provider["breaker"].record_failure()
                errors.append(f"{provider['name']}: {e}")

        raise AllProvidersFailedError(
            f"All LLM providers failed: {'; '.join(errors)}"
        )

Provider-Specific Prompt Adaptation

Different models respond differently to the same prompt. When falling back to an alternative provider, you may need to adapt the prompt. Maintain provider-specific prompt variations for critical differences (such as system message handling, tool calling format, or output formatting preferences).

Error Classification: Retryable vs Fatal

Not all errors should trigger retries. Classifying errors correctly prevents wasting time retrying errors that will never succeed.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class ErrorClassifier:
    RETRYABLE_ERRORS = {
        429: "rate_limit",          # Rate limited — retry after backoff
        500: "server_error",        # Server error — may recover
        502: "bad_gateway",         # Infrastructure — usually transient
        503: "service_unavailable", # Service down — may recover
        504: "gateway_timeout",     # Timeout — may recover
    }

    FATAL_ERRORS = {
        400: "bad_request",         # Our request is malformed
        401: "unauthorized",        # Invalid API key
        403: "forbidden",           # Permission denied
        404: "not_found",           # Resource does not exist
        422: "validation_error",    # Input validation failed
    }

    @classmethod
    def classify(cls, error) -> dict:
        if hasattr(error, "status_code"):
            code = error.status_code
            if code in cls.RETRYABLE_ERRORS:
                return {
                    "retryable": True,
                    "category": cls.RETRYABLE_ERRORS[code],
                    "strategy": "exponential_backoff",
                }
            if code in cls.FATAL_ERRORS:
                return {
                    "retryable": False,
                    "category": cls.FATAL_ERRORS[code],
                    "strategy": "fail_fast",
                }

        if isinstance(error, TimeoutError):
            return {
                "retryable": True,
                "category": "timeout",
                "strategy": "retry_with_longer_timeout",
            }

        if isinstance(error, ConnectionError):
            return {
                "retryable": True,
                "category": "connection",
                "strategy": "exponential_backoff",
            }

        return {
            "retryable": False,
            "category": "unknown",
            "strategy": "fail_fast",
        }

Self-Correction Loops

Beyond retrying external failures, agents can self-correct when their own output is wrong. This is distinct from retry — the request succeeded but the result was incorrect.

Output Validation and Correction

class SelfCorrectingAgent:
    def __init__(self, llm_client, max_corrections: int = 3):
        self.llm = llm_client
        self.max_corrections = max_corrections

    async def execute_with_correction(
        self,
        messages: list[dict],
        validators: list[callable],
    ) -> str:
        response = await self.llm.chat(messages=messages)

        for correction_attempt in range(self.max_corrections):
            validation_errors = []
            for validator in validators:
                error = validator(response)
                if error:
                    validation_errors.append(error)

            if not validation_errors:
                return response

            # Ask agent to self-correct
            error_description = "\n".join(
                f"- {e}" for e in validation_errors
            )
            messages.append({"role": "assistant", "content": response})
            messages.append({
                "role": "user",
                "content": (
                    f"Your response has the following issues:\n"
                    f"{error_description}\n\n"
                    f"Please correct your response."
                ),
            })

            response = await self.llm.chat(messages=messages)

        # Return last response even if not perfect
        return response

Validators for Common Error Types

def validate_no_hallucinated_urls(response: str) -> str | None:
    """Check that URLs in the response are from approved domains."""
    url_pattern = r"https?://[\w\-\.]+\.[a-z]{2,}"
    urls = re.findall(url_pattern, response)
    approved = {"docs.example.com", "api.example.com", "example.com"}
    bad_urls = [u for u in urls if not any(d in u for d in approved)]
    if bad_urls:
        return f"Response contains unapproved URLs: {bad_urls}"
    return None

def validate_sql_safety(response: str) -> str | None:
    """Check that generated SQL does not contain dangerous operations."""
    dangerous = ["DROP", "DELETE", "TRUNCATE", "ALTER", "GRANT"]
    upper_response = response.upper()
    found = [d for d in dangerous if d in upper_response]
    if found:
        return f"Response contains dangerous SQL: {found}"
    return None

Graceful Degradation Strategies

When components fail and cannot be recovered, the system should degrade gracefully rather than crash entirely.

Feature-Level Degradation

class AgentCapabilities:
    def __init__(self):
        self.capabilities = {
            "rag_search": True,
            "tool_execution": True,
            "voice_output": True,
            "image_analysis": True,
        }

    def disable(self, capability: str, reason: str):
        self.capabilities[capability] = False
        logger.warning(f"Capability '{capability}' disabled: {reason}")

    def get_system_prompt_suffix(self) -> str:
        disabled = [k for k, v in self.capabilities.items() if not v]
        if not disabled:
            return ""
        return (
            "\n\nNOTE: The following capabilities are currently "
            "unavailable due to temporary issues: "
            + ", ".join(disabled)
            + ". Inform the user if they request something that "
            "requires these capabilities."
        )

When the vector database is down, disable RAG search and inform the agent. The agent can still answer questions from its training knowledge and tell users that knowledge base search is temporarily unavailable. This is vastly better than the entire agent going offline.

Timeout-Based Degradation

Set timeouts for each external dependency. When a timeout fires, proceed without that dependency's input rather than blocking indefinitely.

async def get_enriched_context(
    user_query: str,
    timeout: float = 3.0,
) -> dict:
    context = {"user_query": user_query}

    # Try to get RAG context, but don't block on it
    try:
        rag_results = await asyncio.wait_for(
            search_knowledge_base(user_query),
            timeout=timeout,
        )
        context["rag_results"] = rag_results
    except asyncio.TimeoutError:
        context["rag_results"] = None
        context["degraded"] = True
        logger.warning("RAG search timed out, proceeding without context")

    return context

Monitoring Self-Healing Behavior

Track how often self-healing mechanisms activate to understand system health and identify recurring issues.

Key metrics to monitor include circuit breaker state transitions (how often breakers open, how long they stay open), retry rates per provider and error type, fallback activation frequency, self-correction attempts and success rates, and degraded capability duration.

class ResilienceMetrics:
    async def record_circuit_event(
        self, provider: str, from_state: str, to_state: str
    ):
        await self.emit("circuit_breaker_transition", {
            "provider": provider,
            "from": from_state,
            "to": to_state,
            "timestamp": datetime.utcnow().isoformat(),
        })

    async def record_retry(
        self, operation: str, attempt: int, error_category: str
    ):
        await self.emit("retry_attempt", {
            "operation": operation,
            "attempt": attempt,
            "error_category": error_category,
        })

    async def record_fallback(
        self, from_provider: str, to_provider: str, reason: str
    ):
        await self.emit("fallback_activated", {
            "from": from_provider,
            "to": to_provider,
            "reason": reason,
        })

Frequently Asked Questions

What is the difference between retry and self-healing in agent systems?

Retry addresses transient external failures — the same request is sent again hoping the service has recovered. Self-healing is broader: it includes retry but also self-correction (the agent fixes its own output), fallback routing (switching to alternative providers), graceful degradation (operating with reduced capabilities), and circuit breaking (proactively stopping requests to failing services). Self-healing agents adapt their behavior based on the error, not just repeat the same action.

How do you decide when to use a circuit breaker versus simple retry?

Use simple retry for isolated errors that are likely transient (one timeout in an otherwise healthy service). Use circuit breakers when errors indicate systemic issues (multiple consecutive failures suggesting the service is down). A good rule of thumb: if you have retried 3-5 times and the service is still failing, the circuit breaker should open to prevent further wasted requests and give the service time to recover.

Should fallback LLM providers use the same prompts?

Ideally yes, but in practice, different models may need prompt adjustments. Maintain a primary prompt and a set of provider-specific overrides for known behavioral differences. Test your prompts on all fallback providers proactively — do not discover prompt incompatibilities during an outage when the fallback is actually needed.

How do you prevent self-correction loops from running indefinitely?

Always set a maximum correction attempt limit (typically 2-3 attempts). After the limit, accept the best available response or escalate to a human. Track self-correction rates: if an agent frequently needs corrections, the underlying prompt or tool configuration likely needs improvement rather than relying on self-correction as a crutch.

What is graceful degradation in the context of agentic AI?

Graceful degradation means the agent continues to function with reduced capabilities when components fail, rather than failing entirely. If the knowledge base is down, the agent still answers from its training data. If TTS fails, the agent provides text responses instead of voice. If the primary LLM is unavailable, it falls back to an alternative model. The key principle is that partial service is always better than no service.

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.