Building an AI Agent Debugger: An Agent That Debugs Other Agents
Learn how to build a meta-agent that analyzes execution traces from other agents, diagnoses failures in tool calls and reasoning chains, suggests fixes, and can even apply automated remediation.
The Debugging Problem in Agent Systems
When a traditional program fails, you get a stack trace pointing to the exact line. When an agent fails, you get a plausible-sounding wrong answer with no obvious error. The agent might have called the wrong tool, misinterpreted a tool's output, lost context mid-conversation, or hallucinated a fact that derailed downstream reasoning. Debugging requires analyzing the full execution trace — every LLM call, tool invocation, and decision point.
A meta-agent — an agent specifically designed to debug other agents — can automate this analysis. It ingests execution traces, identifies failure patterns, and suggests (or applies) fixes.
Capturing Structured Execution Traces
First, you need a tracing layer that records every step of agent execution in a structured format.
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
import json
@dataclass
class TraceEvent:
timestamp: str
event_type: str # "llm_call", "tool_call", "tool_result", "handoff", "error"
agent_name: str
data: dict[str, Any]
duration_ms: float = 0
@dataclass
class ExecutionTrace:
trace_id: str
started_at: str
events: list[TraceEvent] = field(default_factory=list)
final_output: str | None = None
success: bool = True
error: str | None = None
def add_event(self, event: TraceEvent):
self.events.append(event)
def to_json(self) -> str:
return json.dumps({
"trace_id": self.trace_id,
"started_at": self.started_at,
"events": [
{
"timestamp": e.timestamp,
"type": e.event_type,
"agent": e.agent_name,
"data": e.data,
"duration_ms": e.duration_ms,
}
for e in self.events
],
"final_output": self.final_output,
"success": self.success,
"error": self.error,
}, indent=2)
Building the Trace Collector
Wrap your agent runner to automatically capture traces.
import time
import uuid
from agents import Agent, Runner
class TracedRunner:
"""Wraps the Agent SDK runner to capture execution traces."""
def __init__(self):
self.traces: list[ExecutionTrace] = []
async def run(self, agent: Agent, input_text: str) -> tuple[Any, ExecutionTrace]:
trace = ExecutionTrace(
trace_id=str(uuid.uuid4()),
started_at=datetime.utcnow().isoformat(),
)
start = time.perf_counter()
try:
result = await Runner.run(agent, input_text)
trace.final_output = result.final_output
trace.success = True
# Extract events from the result's run items
for item in result.new_items:
trace.add_event(TraceEvent(
timestamp=datetime.utcnow().isoformat(),
event_type=self._classify_item(item),
agent_name=agent.name,
data={"content": str(item)},
duration_ms=(time.perf_counter() - start) * 1000,
))
except Exception as e:
trace.success = False
trace.error = str(e)
self.traces.append(trace)
return result if trace.success else None, trace
def _classify_item(self, item) -> str:
type_name = type(item).__name__.lower()
if "tool" in type_name:
return "tool_call"
if "handoff" in type_name:
return "handoff"
return "llm_call"
The Debugger Agent
Now build the agent that analyzes traces. It has specialized tools for different types of analysis.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from agents import Agent, function_tool
@function_tool
def analyze_tool_calls(trace_json: str) -> str:
"""Analyze all tool calls in a trace for errors and anomalies."""
trace = json.loads(trace_json)
tool_events = [e for e in trace["events"] if e["type"] == "tool_call"]
issues = []
for i, event in enumerate(tool_events):
# Check for slow tool calls
if event["duration_ms"] > 5000:
issues.append(f"Tool call #{i} took {event['duration_ms']}ms (slow)")
# Check for repeated identical calls (wasted compute)
for j, other in enumerate(tool_events[i+1:], i+1):
if event["data"] == other["data"]:
issues.append(f"Tool calls #{i} and #{j} are identical (duplicate)")
if not issues:
return "No tool call issues detected."
return "Issues found:\n" + "\n".join(f"- {issue}" for issue in issues)
@function_tool
def detect_reasoning_loops(trace_json: str) -> str:
"""Detect if the agent got stuck in a reasoning loop."""
trace = json.loads(trace_json)
llm_events = [e for e in trace["events"] if e["type"] == "llm_call"]
# Check for repetitive outputs
outputs = [e["data"].get("content", "") for e in llm_events]
for i in range(len(outputs) - 2):
if outputs[i] == outputs[i+1] == outputs[i+2]:
return f"Reasoning loop detected: LLM produced identical output 3 times starting at step {i}."
return "No reasoning loops detected."
@function_tool
def check_context_degradation(trace_json: str) -> str:
"""Check if important context was lost during agent execution."""
trace = json.loads(trace_json)
events = trace["events"]
# Track context size across LLM calls
llm_events = [e for e in events if e["type"] == "llm_call"]
if len(llm_events) > 10:
return ("Warning: Agent made {len(llm_events)} LLM calls. "
"Context window may be near capacity. "
"Early context could be truncated or compressed.")
return "Context appears stable across all LLM calls."
debugger_agent = Agent(
name="Agent Debugger",
instructions="""You are an expert at debugging AI agent systems.
When given an execution trace, systematically:
1. Check for tool call issues (failures, duplicates, slow calls)
2. Look for reasoning loops
3. Check for context degradation
4. Identify the root cause of any failure
5. Suggest specific fixes with code examples
Be precise and actionable in your diagnosis.""",
tools=[analyze_tool_calls, detect_reasoning_loops, check_context_degradation],
)
Running the Debugger on a Failed Trace
async def debug_failed_agent(trace: ExecutionTrace):
"""Hand a failed trace to the debugger agent for analysis."""
debug_prompt = f"""Analyze this failed agent execution trace and identify
the root cause of failure:
{trace.to_json()}
The agent was expected to produce a correct result but either failed
with an error or produced incorrect output. Diagnose the issue and
suggest a fix."""
result = await Runner.run(debugger_agent, debug_prompt)
return result.final_output
Automated Remediation
The debugger can go beyond diagnosis and apply fixes. Common remediations include retrying with adjusted parameters, rewriting the system prompt, or modifying tool configurations.
@function_tool
def apply_remediation(
fix_type: str,
agent_name: str,
parameters: str,
) -> str:
"""Apply an automated fix to a failing agent.
fix_type: "retry_with_temp", "add_instruction", "disable_tool"
"""
params = json.loads(parameters)
if fix_type == "retry_with_temp":
new_temp = params.get("temperature", 0.3)
return f"Scheduled retry of {agent_name} with temperature={new_temp}"
elif fix_type == "add_instruction":
instruction = params.get("instruction", "")
return f"Added instruction to {agent_name}: '{instruction}'"
elif fix_type == "disable_tool":
tool_name = params.get("tool_name", "")
return f"Disabled tool '{tool_name}' on {agent_name} due to repeated failures"
return f"Unknown fix type: {fix_type}"
Failure Pattern Database
Store diagnosed failures to build institutional knowledge. When the same pattern appears again, the debugger can reference past fixes.
class FailurePatternDB:
def __init__(self, db_path: str = "failures.db"):
self.db = sqlite3.connect(db_path)
self.db.execute("""
CREATE TABLE IF NOT EXISTS failure_patterns (
id INTEGER PRIMARY KEY,
pattern_signature TEXT UNIQUE,
description TEXT,
root_cause TEXT,
fix_applied TEXT,
occurrences INTEGER DEFAULT 1,
last_seen TEXT
)
""")
def record_failure(self, signature: str, description: str,
root_cause: str, fix: str):
self.db.execute("""
INSERT INTO failure_patterns (pattern_signature, description,
root_cause, fix_applied, last_seen)
VALUES (?, ?, ?, ?, datetime('now'))
ON CONFLICT(pattern_signature) DO UPDATE SET
occurrences = occurrences + 1,
last_seen = datetime('now')
""", (signature, description, root_cause, fix))
self.db.commit()
FAQ
Can the debugger agent itself fail, and how do you handle that?
Yes, and this is a genuine concern. The key mitigation is making the debugger simpler than the agents it debugs. The debugger uses deterministic analysis tools (pattern matching, counting, comparisons) rather than complex reasoning. If the debugger fails, fall back to logging the raw trace for manual human review. Never create a recursive debugging chain — one level of meta-debugging is the practical maximum.
How do you generate the "signature" for failure patterns?
Hash the combination of: the failing agent name, the tool that failed (if any), the error type, and a normalized version of the error message. This groups similar failures together. For reasoning failures where there is no explicit error, use the sequence of tool calls as the signature — two failures with the same tool-call pattern likely share a root cause.
What is the difference between this and traditional observability?
Traditional observability (logging, metrics, distributed tracing) captures raw data. A debugger agent adds an interpretation layer — it understands what the data means in the context of agent behavior. It knows that three identical tool calls in a row signals a loop, or that a tool returning null before a hallucinated response indicates a context loss. It transforms data into diagnosis.
#AgentDebugging #MetaAgent #TraceAnalysis #AIObservability #FailureDiagnosis #ProductionAI #AgenticAI #Debugging
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.