Error Tracking in AI Agent Systems: Sentry, PagerDuty, and Custom Alerting
Implement comprehensive error tracking for AI agent systems with error classification, severity-based alert routing to Sentry and PagerDuty, and incident response workflows tailored to LLM failure modes.
Agent Error Modes Are Different
Traditional applications have well-understood failure modes: null pointer exceptions, connection timeouts, authentication failures. AI agents add an entirely new category of errors that are harder to detect and classify. An LLM might return a syntactically valid response that calls a nonexistent tool. A tool call might succeed with HTTP 200 but return data the agent misinterprets. The agent might enter an infinite loop of tool calls without ever producing a final answer.
These failure modes require error tracking that goes beyond exception monitoring. You need to classify errors by type, route alerts based on severity and impact, and build incident response workflows that account for the probabilistic nature of LLM behavior.
Classifying Agent Errors
Define a taxonomy of error types so your alerting can be granular. Not all agent errors deserve the same response.
from enum import Enum
class AgentErrorType(Enum):
# Infrastructure errors - immediate attention
LLM_API_UNREACHABLE = "llm_api_unreachable"
DATABASE_CONNECTION_FAILED = "database_connection_failed"
TOOL_SERVER_DOWN = "tool_server_down"
# LLM behavior errors - investigate if frequent
LLM_INVALID_TOOL_CALL = "llm_invalid_tool_call"
LLM_REFUSED_REQUEST = "llm_refused_request"
LLM_INFINITE_LOOP = "llm_infinite_loop"
LLM_CONTEXT_OVERFLOW = "llm_context_overflow"
# Tool execution errors - may need tool-specific fixes
TOOL_EXECUTION_FAILED = "tool_execution_failed"
TOOL_TIMEOUT = "tool_timeout"
TOOL_INVALID_RESPONSE = "tool_invalid_response"
# Validation errors - usually indicates prompt issues
OUTPUT_VALIDATION_FAILED = "output_validation_failed"
GUARDRAIL_TRIGGERED = "guardrail_triggered"
class AgentError(Exception):
def __init__(
self,
error_type: AgentErrorType,
message: str,
severity: str = "error",
context: dict = None,
):
super().__init__(message)
self.error_type = error_type
self.severity = severity
self.context = context or {}
Integrating Sentry for Error Tracking
Sentry captures exceptions with full stack traces, groups them by root cause, and tracks their frequency over time. Configure it to enrich agent errors with custom context.
import sentry_sdk
from sentry_sdk import set_tag, set_context, capture_exception
sentry_sdk.init(
dsn="https://your-key@sentry.io/project-id",
traces_sample_rate=0.1,
environment="production",
release="agent-service@1.2.0",
)
async def handle_agent_error(error: AgentError, conversation_id: str, user_id: str):
"""Report agent errors to Sentry with rich context."""
set_tag("error_type", error.error_type.value)
set_tag("severity", error.severity)
set_tag("agent_name", error.context.get("agent_name", "unknown"))
set_context("agent", {
"conversation_id": conversation_id,
"user_id": user_id,
"error_type": error.error_type.value,
"model": error.context.get("model"),
"tool_name": error.context.get("tool_name"),
"step": error.context.get("step"),
})
capture_exception(error)
Building a Custom Alert Router
Different error types warrant different responses. Infrastructure errors need PagerDuty pages. LLM behavior errors need Slack notifications. Validation errors need logging for later analysis.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from dataclasses import dataclass
import httpx
@dataclass
class AlertConfig:
pagerduty_key: str
slack_webhook: str
email_endpoint: str
class AlertRouter:
def __init__(self, config: AlertConfig):
self.config = config
self.client = httpx.AsyncClient()
async def route_alert(self, error: AgentError, conversation_id: str):
error_type = error.error_type
# Critical infrastructure errors -> PagerDuty
if error_type in (
AgentErrorType.LLM_API_UNREACHABLE,
AgentErrorType.DATABASE_CONNECTION_FAILED,
AgentErrorType.TOOL_SERVER_DOWN,
):
await self._page_oncall(error, conversation_id)
await self._notify_slack(error, conversation_id, channel="#incidents")
# LLM behavior errors -> Slack warning
elif error_type in (
AgentErrorType.LLM_INFINITE_LOOP,
AgentErrorType.LLM_CONTEXT_OVERFLOW,
):
await self._notify_slack(error, conversation_id, channel="#agent-alerts")
# Tool errors -> Slack if frequent
elif error_type in (
AgentErrorType.TOOL_EXECUTION_FAILED,
AgentErrorType.TOOL_TIMEOUT,
):
if await self._error_rate_exceeds_threshold(error_type, threshold=10, window_minutes=5):
await self._notify_slack(error, conversation_id, channel="#agent-alerts")
async def _page_oncall(self, error: AgentError, conversation_id: str):
await self.client.post(
"https://events.pagerduty.com/v2/enqueue",
json={
"routing_key": self.config.pagerduty_key,
"event_action": "trigger",
"payload": {
"summary": f"Agent error: {error.error_type.value} - {str(error)}",
"severity": "critical",
"source": "agent-service",
"custom_details": {
"conversation_id": conversation_id,
**error.context,
},
},
},
)
async def _notify_slack(self, error: AgentError, conversation_id: str, channel: str):
await self.client.post(
self.config.slack_webhook,
json={
"channel": channel,
"text": f"*Agent Error*: {error.error_type.value}\n"
f"Message: {str(error)}\n"
f"Conversation: {conversation_id}",
},
)
Detecting Agent-Specific Failure Patterns
Some agent failures do not raise exceptions. Detect them with runtime checks.
MAX_TOOL_CALLS_PER_TURN = 10
MAX_AGENT_TURNS = 25
class AgentLoopGuard:
def __init__(self):
self.tool_call_count = 0
self.turn_count = 0
self.seen_tool_calls = []
def check_tool_call(self, tool_name: str, arguments: dict):
self.tool_call_count += 1
call_signature = f"{tool_name}:{hash(str(sorted(arguments.items())))}"
# Detect infinite loop: same tool call repeated
if call_signature in self.seen_tool_calls[-3:]:
raise AgentError(
AgentErrorType.LLM_INFINITE_LOOP,
f"Repeated tool call detected: {tool_name}",
severity="critical",
context={"tool_name": tool_name, "repeat_count": 3},
)
if self.tool_call_count > MAX_TOOL_CALLS_PER_TURN:
raise AgentError(
AgentErrorType.LLM_INFINITE_LOOP,
f"Tool call limit exceeded: {self.tool_call_count}",
severity="error",
)
self.seen_tool_calls.append(call_signature)
def check_turn(self):
self.turn_count += 1
if self.turn_count > MAX_AGENT_TURNS:
raise AgentError(
AgentErrorType.LLM_INFINITE_LOOP,
f"Agent turn limit exceeded: {self.turn_count}",
severity="error",
)
FAQ
How do I avoid alert fatigue with AI agents?
Use rate-based alerting instead of per-error alerting. A single tool failure is normal — tools can be temporarily unavailable. Page oncall only when the error rate for a given type exceeds a threshold within a time window. For LLM behavior errors, alert on percentage of conversations affected rather than raw count. Review and tune thresholds weekly during the first month of deployment.
Should I retry LLM calls automatically before raising an error?
Yes, but with limits. Retry transient errors like rate limits (HTTP 429) and server errors (HTTP 500-503) with exponential backoff, up to 3 attempts. Do not retry content policy violations (HTTP 400) or context length errors — these will fail again with the same input. Track retry counts in your error metadata so you can monitor retry rates.
How do I handle errors gracefully so the user gets a useful response?
Implement a fallback chain. If the primary model fails, try a fallback model. If all LLM calls fail, return a static message like "I am having trouble processing your request. Please try again in a moment." Never expose raw error messages or stack traces to users. Log the full error details for your engineering team and return a user-friendly message with a reference ID they can share with support.
#ErrorTracking #Sentry #PagerDuty #Alerting #IncidentResponse #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.