Skip to content
Learn Agentic AI13 min read0 views

Error Tracking in AI Agent Systems: Sentry, PagerDuty, and Custom Alerting

Implement comprehensive error tracking for AI agent systems with error classification, severity-based alert routing to Sentry and PagerDuty, and incident response workflows tailored to LLM failure modes.

Agent Error Modes Are Different

Traditional applications have well-understood failure modes: null pointer exceptions, connection timeouts, authentication failures. AI agents add an entirely new category of errors that are harder to detect and classify. An LLM might return a syntactically valid response that calls a nonexistent tool. A tool call might succeed with HTTP 200 but return data the agent misinterprets. The agent might enter an infinite loop of tool calls without ever producing a final answer.

These failure modes require error tracking that goes beyond exception monitoring. You need to classify errors by type, route alerts based on severity and impact, and build incident response workflows that account for the probabilistic nature of LLM behavior.

Classifying Agent Errors

Define a taxonomy of error types so your alerting can be granular. Not all agent errors deserve the same response.

from enum import Enum

class AgentErrorType(Enum):
    # Infrastructure errors - immediate attention
    LLM_API_UNREACHABLE = "llm_api_unreachable"
    DATABASE_CONNECTION_FAILED = "database_connection_failed"
    TOOL_SERVER_DOWN = "tool_server_down"

    # LLM behavior errors - investigate if frequent
    LLM_INVALID_TOOL_CALL = "llm_invalid_tool_call"
    LLM_REFUSED_REQUEST = "llm_refused_request"
    LLM_INFINITE_LOOP = "llm_infinite_loop"
    LLM_CONTEXT_OVERFLOW = "llm_context_overflow"

    # Tool execution errors - may need tool-specific fixes
    TOOL_EXECUTION_FAILED = "tool_execution_failed"
    TOOL_TIMEOUT = "tool_timeout"
    TOOL_INVALID_RESPONSE = "tool_invalid_response"

    # Validation errors - usually indicates prompt issues
    OUTPUT_VALIDATION_FAILED = "output_validation_failed"
    GUARDRAIL_TRIGGERED = "guardrail_triggered"

class AgentError(Exception):
    def __init__(
        self,
        error_type: AgentErrorType,
        message: str,
        severity: str = "error",
        context: dict = None,
    ):
        super().__init__(message)
        self.error_type = error_type
        self.severity = severity
        self.context = context or {}

Integrating Sentry for Error Tracking

Sentry captures exceptions with full stack traces, groups them by root cause, and tracks their frequency over time. Configure it to enrich agent errors with custom context.

import sentry_sdk
from sentry_sdk import set_tag, set_context, capture_exception

sentry_sdk.init(
    dsn="https://your-key@sentry.io/project-id",
    traces_sample_rate=0.1,
    environment="production",
    release="agent-service@1.2.0",
)

async def handle_agent_error(error: AgentError, conversation_id: str, user_id: str):
    """Report agent errors to Sentry with rich context."""
    set_tag("error_type", error.error_type.value)
    set_tag("severity", error.severity)
    set_tag("agent_name", error.context.get("agent_name", "unknown"))

    set_context("agent", {
        "conversation_id": conversation_id,
        "user_id": user_id,
        "error_type": error.error_type.value,
        "model": error.context.get("model"),
        "tool_name": error.context.get("tool_name"),
        "step": error.context.get("step"),
    })

    capture_exception(error)

Building a Custom Alert Router

Different error types warrant different responses. Infrastructure errors need PagerDuty pages. LLM behavior errors need Slack notifications. Validation errors need logging for later analysis.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from dataclasses import dataclass
import httpx

@dataclass
class AlertConfig:
    pagerduty_key: str
    slack_webhook: str
    email_endpoint: str

class AlertRouter:
    def __init__(self, config: AlertConfig):
        self.config = config
        self.client = httpx.AsyncClient()

    async def route_alert(self, error: AgentError, conversation_id: str):
        error_type = error.error_type

        # Critical infrastructure errors -> PagerDuty
        if error_type in (
            AgentErrorType.LLM_API_UNREACHABLE,
            AgentErrorType.DATABASE_CONNECTION_FAILED,
            AgentErrorType.TOOL_SERVER_DOWN,
        ):
            await self._page_oncall(error, conversation_id)
            await self._notify_slack(error, conversation_id, channel="#incidents")

        # LLM behavior errors -> Slack warning
        elif error_type in (
            AgentErrorType.LLM_INFINITE_LOOP,
            AgentErrorType.LLM_CONTEXT_OVERFLOW,
        ):
            await self._notify_slack(error, conversation_id, channel="#agent-alerts")

        # Tool errors -> Slack if frequent
        elif error_type in (
            AgentErrorType.TOOL_EXECUTION_FAILED,
            AgentErrorType.TOOL_TIMEOUT,
        ):
            if await self._error_rate_exceeds_threshold(error_type, threshold=10, window_minutes=5):
                await self._notify_slack(error, conversation_id, channel="#agent-alerts")

    async def _page_oncall(self, error: AgentError, conversation_id: str):
        await self.client.post(
            "https://events.pagerduty.com/v2/enqueue",
            json={
                "routing_key": self.config.pagerduty_key,
                "event_action": "trigger",
                "payload": {
                    "summary": f"Agent error: {error.error_type.value} - {str(error)}",
                    "severity": "critical",
                    "source": "agent-service",
                    "custom_details": {
                        "conversation_id": conversation_id,
                        **error.context,
                    },
                },
            },
        )

    async def _notify_slack(self, error: AgentError, conversation_id: str, channel: str):
        await self.client.post(
            self.config.slack_webhook,
            json={
                "channel": channel,
                "text": f"*Agent Error*: {error.error_type.value}\n"
                        f"Message: {str(error)}\n"
                        f"Conversation: {conversation_id}",
            },
        )

Detecting Agent-Specific Failure Patterns

Some agent failures do not raise exceptions. Detect them with runtime checks.

MAX_TOOL_CALLS_PER_TURN = 10
MAX_AGENT_TURNS = 25

class AgentLoopGuard:
    def __init__(self):
        self.tool_call_count = 0
        self.turn_count = 0
        self.seen_tool_calls = []

    def check_tool_call(self, tool_name: str, arguments: dict):
        self.tool_call_count += 1
        call_signature = f"{tool_name}:{hash(str(sorted(arguments.items())))}"

        # Detect infinite loop: same tool call repeated
        if call_signature in self.seen_tool_calls[-3:]:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Repeated tool call detected: {tool_name}",
                severity="critical",
                context={"tool_name": tool_name, "repeat_count": 3},
            )

        if self.tool_call_count > MAX_TOOL_CALLS_PER_TURN:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Tool call limit exceeded: {self.tool_call_count}",
                severity="error",
            )

        self.seen_tool_calls.append(call_signature)

    def check_turn(self):
        self.turn_count += 1
        if self.turn_count > MAX_AGENT_TURNS:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Agent turn limit exceeded: {self.turn_count}",
                severity="error",
            )

FAQ

How do I avoid alert fatigue with AI agents?

Use rate-based alerting instead of per-error alerting. A single tool failure is normal — tools can be temporarily unavailable. Page oncall only when the error rate for a given type exceeds a threshold within a time window. For LLM behavior errors, alert on percentage of conversations affected rather than raw count. Review and tune thresholds weekly during the first month of deployment.

Should I retry LLM calls automatically before raising an error?

Yes, but with limits. Retry transient errors like rate limits (HTTP 429) and server errors (HTTP 500-503) with exponential backoff, up to 3 attempts. Do not retry content policy violations (HTTP 400) or context length errors — these will fail again with the same input. Track retry counts in your error metadata so you can monitor retry rates.

How do I handle errors gracefully so the user gets a useful response?

Implement a fallback chain. If the primary model fails, try a fallback model. If all LLM calls fail, return a static message like "I am having trouble processing your request. Please try again in a moment." Never expose raw error messages or stack traces to users. Log the full error details for your engineering team and return a user-friendly message with a reference ID they can share with support.


#ErrorTracking #Sentry #PagerDuty #Alerting #IncidentResponse #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.