Comprehensive Error Handling for AI Agents: A Taxonomy of Failure Modes

Why AI Agents Fail Differently Than Traditional Software

Traditional software fails in predictable ways — null pointers, type mismatches, connection refused. AI agents introduce an entirely new dimension of failure because they rely on probabilistic models, external APIs with variable latency, and tool integrations that can break in subtle ways. A robust agent needs a structured error taxonomy so every failure is caught, categorized, and handled appropriately.

Without a taxonomy, teams end up with a patchwork of try/except blocks that swallow important errors and let destructive ones pass through silently.

The Four Categories of Agent Failure

Every error in an AI agent system falls into one of four categories, each demanding a different response strategy.

flowchart TD
    START["Comprehensive Error Handling for AI Agents: A Tax…"] --> A
    A["Why AI Agents Fail Differently Than Tra…"]
    A --> B
    B["The Four Categories of Agent Failure"]
    B --> C
    C["Building a Unified Error Handler"]
    C --> D
    D["FAQ"]
    D --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Category 1: LLM Errors

These originate from the language model itself — rate limits, context length exceeded, malformed output, or hallucinated tool calls.

from enum import Enum
from dataclasses import dataclass
from typing import Optional

class ErrorCategory(Enum):
    LLM = "llm"
    TOOL = "tool"
    NETWORK = "network"
    BUSINESS_LOGIC = "business_logic"

class ErrorSeverity(Enum):
    RECOVERABLE = "recoverable"
    DEGRADED = "degraded"
    FATAL = "fatal"

@dataclass
class AgentError:
    category: ErrorCategory
    severity: ErrorSeverity
    message: str
    original_exception: Optional[Exception] = None
    retry_eligible: bool = True
    context: dict = None

    def __post_init__(self):
        if self.context is None:
            self.context = {}

Category 2: Tool Execution Errors

Tools are the hands of your agent. When a database query fails, an API returns unexpected data, or a file system operation is denied, the agent must distinguish between a tool that is temporarily down and one that received bad input.

class ToolErrorClassifier:
    """Classifies tool errors to determine the correct recovery strategy."""

    TRANSIENT_EXCEPTIONS = (
        ConnectionError,
        TimeoutError,
        OSError,
    )

    @staticmethod
    def classify(tool_name: str, exc: Exception) -> AgentError:
        if isinstance(exc, ToolErrorClassifier.TRANSIENT_EXCEPTIONS):
            return AgentError(
                category=ErrorCategory.TOOL,
                severity=ErrorSeverity.RECOVERABLE,
                message=f"Tool '{tool_name}' hit a transient error: {exc}",
                original_exception=exc,
                retry_eligible=True,
                context={"tool": tool_name},
            )

        if isinstance(exc, ValueError):
            return AgentError(
                category=ErrorCategory.TOOL,
                severity=ErrorSeverity.DEGRADED,
                message=f"Tool '{tool_name}' received invalid input: {exc}",
                original_exception=exc,
                retry_eligible=False,
                context={"tool": tool_name},
            )

        return AgentError(
            category=ErrorCategory.TOOL,
            severity=ErrorSeverity.FATAL,
            message=f"Tool '{tool_name}' failed unexpectedly: {exc}",
            original_exception=exc,
            retry_eligible=False,
            context={"tool": tool_name},
        )

Category 3: Network Errors

Network errors are the most common transient failure. They include DNS resolution failures, TLS handshake timeouts, connection resets, and HTTP 5xx responses from upstream providers.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Category 4: Business Logic Errors

These are the most dangerous because they look like success. The LLM returns valid JSON, the tool executes without exception, but the result violates a business rule — for example, booking an appointment in the past or transferring funds exceeding an account balance.

class BusinessRuleValidator:
    """Validates agent outputs against business rules before execution."""

    def __init__(self):
        self.rules = []

    def add_rule(self, name: str, check_fn, error_msg: str):
        self.rules.append({"name": name, "check": check_fn, "msg": error_msg})

    def validate(self, action: dict) -> list[AgentError]:
        errors = []
        for rule in self.rules:
            if not rule["check"](action):
                errors.append(AgentError(
                    category=ErrorCategory.BUSINESS_LOGIC,
                    severity=ErrorSeverity.FATAL,
                    message=rule["msg"],
                    retry_eligible=False,
                    context={"action": action, "rule": rule["name"]},
                ))
        return errors

# Usage
validator = BusinessRuleValidator()
validator.add_rule(
    "future_date",
    lambda a: a.get("date") and a["date"] > "2026-03-17",
    "Cannot schedule appointments in the past.",
)

Building a Unified Error Handler

The key insight is routing every error through a single handler that decides the response based on category and severity.

class AgentErrorHandler:
    def handle(self, error: AgentError) -> str:
        if error.severity == ErrorSeverity.RECOVERABLE and error.retry_eligible:
            return "retry"
        elif error.severity == ErrorSeverity.DEGRADED:
            return "fallback"
        else:
            return "abort"

This taxonomy becomes the foundation for every resilience pattern covered in the remaining posts of this series.

FAQ

Why not just use a generic try/except around the entire agent loop?

A blanket try/except hides the root cause and makes it impossible to choose the right recovery strategy. Retrying a business logic error wastes tokens and time, while aborting on a transient network glitch leaves money on the table. Categorization enables targeted responses.

Should business logic validation happen before or after tool execution?

Always before. Once a tool has executed a destructive action — sending an email, charging a card — you cannot undo it. Validate the planned action against business rules before calling the tool, and only allow execution if all checks pass.

How do I handle errors from the LLM itself, like hallucinated function calls?

Parse the LLM output with a strict schema validator such as Pydantic. If the model returns a tool call that does not match any registered tool name or produces arguments that fail validation, classify it as an LLM error with recoverable severity. Re-prompt the model with the validation error and let it self-correct, up to a maximum retry count.

#ErrorHandling #AIAgents #FailureModes #Python #Resilience #AgenticAI #LearnAI #AIEngineering

Comprehensive Error Handling for AI Agents: A Taxonomy of Failure Modes

Why AI Agents Fail Differently Than Traditional Software

The Four Categories of Agent Failure

Category 1: LLM Errors

Category 2: Tool Execution Errors

Category 3: Network Errors

Category 4: Business Logic Errors

Building a Unified Error Handler

FAQ

Why not just use a generic try/except around the entire agent loop?

Should business logic validation happen before or after tool execution?

How do I handle errors from the LLM itself, like hallucinated function calls?

Try CallSphere AI Voice Agents

Related Articles

Building an AI Agent with Tool-Use Chains: Sequential Tool Orchestration for Complex Tasks

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Building a Hypothesis-Testing Agent: Scientific Method Applied to Data Analysis