Error Recovery Patterns: Self-Healing Agents That Fix Their Own Mistakes
Build AI agents that detect their own errors, apply correction strategies, and learn from failures through feedback loops. Covers error detection, self-correction, escalation paths, and continuous improvement.
Beyond Crash and Retry: Agents That Correct Themselves
Traditional error handling stops at retry and abort. But LLM-powered agents have a unique capability that conventional software does not — they can reason about their own failures. When a tool call returns an error, the agent can read the error message, understand what went wrong, and try a different approach. This self-healing capability is what separates fragile demos from production-grade agents.
The challenge is building structured self-healing that is reliable, bounded, and observable.
The Self-Healing Loop
A self-healing agent wraps its execution in a loop that detects errors, diagnoses the cause, and applies a correction strategy.
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, Optional
import logging
logger = logging.getLogger("agent.self_heal")
class RecoveryAction(Enum):
RETRY_SAME = "retry_same"
RETRY_MODIFIED = "retry_modified"
USE_ALTERNATIVE = "use_alternative"
ASK_USER = "ask_user"
ESCALATE = "escalate"
ABORT = "abort"
@dataclass
class ErrorDiagnosis:
error_type: str
root_cause: str
recovery_action: RecoveryAction
modified_args: Optional[dict] = None
alternative_tool: Optional[str] = None
user_message: Optional[str] = None
@dataclass
class HealingAttempt:
diagnosis: ErrorDiagnosis
success: bool
result: Optional[dict] = None
class SelfHealingAgent:
def __init__(self, llm_client, tool_registry: dict, max_healing_attempts: int = 3):
self.llm = llm_client
self.tools = tool_registry
self.max_healing_attempts = max_healing_attempts
self.healing_history: list[HealingAttempt] = []
async def execute_with_healing(
self, tool_name: str, args: dict, context: str = "",
) -> dict:
"""Execute a tool call with self-healing on failure."""
# First attempt
try:
return await self._call_tool(tool_name, args)
except Exception as first_error:
logger.warning(f"Tool {tool_name} failed: {first_error}")
# Self-healing loop
last_error = first_error
for attempt in range(self.max_healing_attempts):
diagnosis = await self._diagnose_error(
tool_name, args, last_error, context,
)
logger.info(
f"Healing attempt {attempt + 1}: {diagnosis.recovery_action.value}"
)
if diagnosis.recovery_action == RecoveryAction.ABORT:
raise RuntimeError(f"Unrecoverable: {diagnosis.root_cause}")
if diagnosis.recovery_action == RecoveryAction.ASK_USER:
return {"needs_input": True, "message": diagnosis.user_message}
if diagnosis.recovery_action == RecoveryAction.ESCALATE:
return {"escalated": True, "reason": diagnosis.root_cause}
try:
result = await self._apply_recovery(diagnosis, tool_name, args)
self.healing_history.append(
HealingAttempt(diagnosis=diagnosis, success=True, result=result)
)
return result
except Exception as exc:
last_error = exc
self.healing_history.append(
HealingAttempt(diagnosis=diagnosis, success=False)
)
raise RuntimeError(
f"Failed after {self.max_healing_attempts} healing attempts"
)
LLM-Powered Error Diagnosis
The agent uses its LLM to analyze the error and determine the best recovery strategy.
async def _diagnose_error(
self, tool_name: str, args: dict, error: Exception, context: str,
) -> ErrorDiagnosis:
"""Use the LLM to diagnose the error and recommend recovery."""
diagnosis_prompt = f"""A tool call failed. Diagnose the error and recommend a recovery action.
Tool: {tool_name}
Arguments: {args}
Error: {type(error).__name__}: {error}
Context: {context}
Previous healing attempts for this request:
{self._format_history()}
Choose ONE recovery action:
- RETRY_MODIFIED: Fix the arguments and retry (provide corrected args)
- USE_ALTERNATIVE: Use a different tool (specify which)
- ASK_USER: Need clarification from the user (provide a question)
- ESCALATE: This needs human operator intervention
- ABORT: This cannot be recovered
Respond in this exact format:
ACTION: <action>
ROOT_CAUSE: <brief explanation>
MODIFIED_ARGS: <JSON if RETRY_MODIFIED, else null>
ALTERNATIVE_TOOL: <tool name if USE_ALTERNATIVE, else null>
USER_MESSAGE: <question if ASK_USER, else null>"""
response = await self.llm.complete(diagnosis_prompt)
return self._parse_diagnosis(response)
Structured Recovery Strategies
Each recovery action maps to a concrete execution path.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
async def _apply_recovery(
self, diagnosis: ErrorDiagnosis, original_tool: str, original_args: dict,
) -> dict:
if diagnosis.recovery_action == RecoveryAction.RETRY_SAME:
return await self._call_tool(original_tool, original_args)
elif diagnosis.recovery_action == RecoveryAction.RETRY_MODIFIED:
modified = {**original_args, **(diagnosis.modified_args or {})}
return await self._call_tool(original_tool, modified)
elif diagnosis.recovery_action == RecoveryAction.USE_ALTERNATIVE:
alt_tool = diagnosis.alternative_tool
if alt_tool not in self.tools:
raise ValueError(f"Alternative tool '{alt_tool}' not found")
return await self._call_tool(alt_tool, original_args)
raise ValueError(f"Unhandled recovery: {diagnosis.recovery_action}")
async def _call_tool(self, tool_name: str, args: dict) -> dict:
tool_fn = self.tools.get(tool_name)
if not tool_fn:
raise ValueError(f"Tool '{tool_name}' not registered")
return await tool_fn(args)
def _format_history(self) -> str:
if not self.healing_history:
return "None"
lines = []
for h in self.healing_history:
lines.append(
f"- {h.diagnosis.recovery_action.value}: "
f"{'succeeded' if h.success else 'failed'} "
f"(cause: {h.diagnosis.root_cause})"
)
return "\n".join(lines)
Feedback Loop for Continuous Improvement
Track which error patterns the agent encounters and how successfully it recovers. This data informs prompt improvements and tool hardening.
from collections import defaultdict
class HealingMetrics:
def __init__(self):
self.error_counts: dict[str, int] = defaultdict(int)
self.recovery_success: dict[str, list[bool]] = defaultdict(list)
def record(self, error_type: str, recovery_action: str, success: bool):
key = f"{error_type}:{recovery_action}"
self.error_counts[error_type] += 1
self.recovery_success[key].append(success)
def success_rate(self, error_type: str, recovery_action: str) -> float:
key = f"{error_type}:{recovery_action}"
results = self.recovery_success.get(key, [])
if not results:
return 0.0
return sum(results) / len(results)
def report(self) -> dict:
report = {}
for key, results in self.recovery_success.items():
rate = sum(results) / len(results) if results else 0
report[key] = {
"attempts": len(results),
"success_rate": round(rate, 2),
}
return report
Guardrails: Preventing Infinite Healing Loops
Always cap the number of healing attempts, track token spend during recovery, and prevent the agent from trying the same failed strategy twice.
class HealingGuardrails:
def __init__(self, max_attempts: int = 3, max_token_budget: int = 5000):
self.max_attempts = max_attempts
self.max_token_budget = max_token_budget
self.tokens_used = 0
self.tried_strategies: set[str] = set()
def can_continue(self, attempt: int, proposed_action: str) -> bool:
if attempt >= self.max_attempts:
return False
if self.tokens_used >= self.max_token_budget:
return False
if proposed_action in self.tried_strategies:
return False
return True
def record_attempt(self, action: str, tokens: int):
self.tried_strategies.add(action)
self.tokens_used += tokens
FAQ
Is it safe to let the LLM decide how to fix its own errors?
Yes, with guardrails. The LLM's diagnosis should be constrained to a fixed set of recovery actions (the RecoveryAction enum). The agent code validates the proposed action and prevents unsafe operations like modifying arguments in ways that bypass business rules. The LLM provides intelligence; the code provides safety boundaries.
How do I prevent the agent from looping between two failing strategies?
Track all attempted strategies in a set and reject any strategy that has already been tried. The HealingGuardrails class above implements this. Additionally, include the full healing history in the diagnosis prompt so the LLM knows which approaches have already failed and can choose a different path.
When should self-healing escalate to a human?
Escalate when the error involves ambiguous user intent (the agent is unsure what the user wants), when the failure involves financial or irreversible actions, or when the maximum healing attempts are exhausted. The escalation path should capture the full context — original request, error, all healing attempts — so the human reviewer can resolve the issue without asking the user to repeat themselves.
#SelfHealing #ErrorRecovery #FeedbackLoops #AIAgents #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.