Building an AI Agent Debugger: An Agent That Debugs Other Agents

The Debugging Problem in Agent Systems

When a traditional program fails, you get a stack trace pointing to the exact line. When an agent fails, you get a plausible-sounding wrong answer with no obvious error. The agent might have called the wrong tool, misinterpreted a tool's output, lost context mid-conversation, or hallucinated a fact that derailed downstream reasoning. Debugging requires analyzing the full execution trace — every LLM call, tool invocation, and decision point.

A meta-agent — an agent specifically designed to debug other agents — can automate this analysis. It ingests execution traces, identifies failure patterns, and suggests (or applies) fixes.

Capturing Structured Execution Traces

First, you need a tracing layer that records every step of agent execution in a structured format.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
import json

@dataclass
class TraceEvent:
    timestamp: str
    event_type: str  # "llm_call", "tool_call", "tool_result", "handoff", "error"
    agent_name: str
    data: dict[str, Any]
    duration_ms: float = 0

@dataclass
class ExecutionTrace:
    trace_id: str
    started_at: str
    events: list[TraceEvent] = field(default_factory=list)
    final_output: str | None = None
    success: bool = True
    error: str | None = None

    def add_event(self, event: TraceEvent):
        self.events.append(event)

    def to_json(self) -> str:
        return json.dumps({
            "trace_id": self.trace_id,
            "started_at": self.started_at,
            "events": [
                {
                    "timestamp": e.timestamp,
                    "type": e.event_type,
                    "agent": e.agent_name,
                    "data": e.data,
                    "duration_ms": e.duration_ms,
                }
                for e in self.events
            ],
            "final_output": self.final_output,
            "success": self.success,
            "error": self.error,
        }, indent=2)

Building the Trace Collector

Wrap your agent runner to automatically capture traces.

import time
import uuid
from agents import Agent, Runner

class TracedRunner:
    """Wraps the Agent SDK runner to capture execution traces."""

    def __init__(self):
        self.traces: list[ExecutionTrace] = []

    async def run(self, agent: Agent, input_text: str) -> tuple[Any, ExecutionTrace]:
        trace = ExecutionTrace(
            trace_id=str(uuid.uuid4()),
            started_at=datetime.utcnow().isoformat(),
        )

        start = time.perf_counter()
        try:
            result = await Runner.run(agent, input_text)
            trace.final_output = result.final_output
            trace.success = True

            # Extract events from the result's run items
            for item in result.new_items:
                trace.add_event(TraceEvent(
                    timestamp=datetime.utcnow().isoformat(),
                    event_type=self._classify_item(item),
                    agent_name=agent.name,
                    data={"content": str(item)},
                    duration_ms=(time.perf_counter() - start) * 1000,
                ))
        except Exception as e:
            trace.success = False
            trace.error = str(e)

        self.traces.append(trace)
        return result if trace.success else None, trace

    def _classify_item(self, item) -> str:
        type_name = type(item).__name__.lower()
        if "tool" in type_name:
            return "tool_call"
        if "handoff" in type_name:
            return "handoff"
        return "llm_call"

The Debugger Agent

Now build the agent that analyzes traces. It has specialized tools for different types of analysis.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from agents import Agent, function_tool

@function_tool
def analyze_tool_calls(trace_json: str) -> str:
    """Analyze all tool calls in a trace for errors and anomalies."""
    trace = json.loads(trace_json)
    tool_events = [e for e in trace["events"] if e["type"] == "tool_call"]

    issues = []
    for i, event in enumerate(tool_events):
        # Check for slow tool calls
        if event["duration_ms"] > 5000:
            issues.append(f"Tool call #{i} took {event['duration_ms']}ms (slow)")

        # Check for repeated identical calls (wasted compute)
        for j, other in enumerate(tool_events[i+1:], i+1):
            if event["data"] == other["data"]:
                issues.append(f"Tool calls #{i} and #{j} are identical (duplicate)")

    if not issues:
        return "No tool call issues detected."
    return "Issues found:\n" + "\n".join(f"- {issue}" for issue in issues)

@function_tool
def detect_reasoning_loops(trace_json: str) -> str:
    """Detect if the agent got stuck in a reasoning loop."""
    trace = json.loads(trace_json)
    llm_events = [e for e in trace["events"] if e["type"] == "llm_call"]

    # Check for repetitive outputs
    outputs = [e["data"].get("content", "") for e in llm_events]
    for i in range(len(outputs) - 2):
        if outputs[i] == outputs[i+1] == outputs[i+2]:
            return f"Reasoning loop detected: LLM produced identical output 3 times starting at step {i}."

    return "No reasoning loops detected."

@function_tool
def check_context_degradation(trace_json: str) -> str:
    """Check if important context was lost during agent execution."""
    trace = json.loads(trace_json)
    events = trace["events"]

    # Track context size across LLM calls
    llm_events = [e for e in events if e["type"] == "llm_call"]

    if len(llm_events) > 10:
        return ("Warning: Agent made {len(llm_events)} LLM calls. "
                "Context window may be near capacity. "
                "Early context could be truncated or compressed.")

    return "Context appears stable across all LLM calls."

debugger_agent = Agent(
    name="Agent Debugger",
    instructions="""You are an expert at debugging AI agent systems.
    When given an execution trace, systematically:
    1. Check for tool call issues (failures, duplicates, slow calls)
    2. Look for reasoning loops
    3. Check for context degradation
    4. Identify the root cause of any failure
    5. Suggest specific fixes with code examples

    Be precise and actionable in your diagnosis.""",
    tools=[analyze_tool_calls, detect_reasoning_loops, check_context_degradation],
)

Running the Debugger on a Failed Trace

async def debug_failed_agent(trace: ExecutionTrace):
    """Hand a failed trace to the debugger agent for analysis."""
    debug_prompt = f"""Analyze this failed agent execution trace and identify
    the root cause of failure:

    {trace.to_json()}

    The agent was expected to produce a correct result but either failed
    with an error or produced incorrect output. Diagnose the issue and
    suggest a fix."""

    result = await Runner.run(debugger_agent, debug_prompt)
    return result.final_output

Automated Remediation

The debugger can go beyond diagnosis and apply fixes. Common remediations include retrying with adjusted parameters, rewriting the system prompt, or modifying tool configurations.

@function_tool
def apply_remediation(
    fix_type: str,
    agent_name: str,
    parameters: str,
) -> str:
    """Apply an automated fix to a failing agent.

    fix_type: "retry_with_temp", "add_instruction", "disable_tool"
    """
    params = json.loads(parameters)

    if fix_type == "retry_with_temp":
        new_temp = params.get("temperature", 0.3)
        return f"Scheduled retry of {agent_name} with temperature={new_temp}"

    elif fix_type == "add_instruction":
        instruction = params.get("instruction", "")
        return f"Added instruction to {agent_name}: '{instruction}'"

    elif fix_type == "disable_tool":
        tool_name = params.get("tool_name", "")
        return f"Disabled tool '{tool_name}' on {agent_name} due to repeated failures"

    return f"Unknown fix type: {fix_type}"

Failure Pattern Database

Store diagnosed failures to build institutional knowledge. When the same pattern appears again, the debugger can reference past fixes.

class FailurePatternDB:
    def __init__(self, db_path: str = "failures.db"):
        self.db = sqlite3.connect(db_path)
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS failure_patterns (
                id INTEGER PRIMARY KEY,
                pattern_signature TEXT UNIQUE,
                description TEXT,
                root_cause TEXT,
                fix_applied TEXT,
                occurrences INTEGER DEFAULT 1,
                last_seen TEXT
            )
        """)

    def record_failure(self, signature: str, description: str,
                       root_cause: str, fix: str):
        self.db.execute("""
            INSERT INTO failure_patterns (pattern_signature, description,
                root_cause, fix_applied, last_seen)
            VALUES (?, ?, ?, ?, datetime('now'))
            ON CONFLICT(pattern_signature) DO UPDATE SET
                occurrences = occurrences + 1,
                last_seen = datetime('now')
        """, (signature, description, root_cause, fix))
        self.db.commit()

FAQ

Can the debugger agent itself fail, and how do you handle that?

Yes, and this is a genuine concern. The key mitigation is making the debugger simpler than the agents it debugs. The debugger uses deterministic analysis tools (pattern matching, counting, comparisons) rather than complex reasoning. If the debugger fails, fall back to logging the raw trace for manual human review. Never create a recursive debugging chain — one level of meta-debugging is the practical maximum.

How do you generate the "signature" for failure patterns?

Hash the combination of: the failing agent name, the tool that failed (if any), the error type, and a normalized version of the error message. This groups similar failures together. For reasoning failures where there is no explicit error, use the sequence of tool calls as the signature — two failures with the same tool-call pattern likely share a root cause.

What is the difference between this and traditional observability?

Traditional observability (logging, metrics, distributed tracing) captures raw data. A debugger agent adds an interpretation layer — it understands what the data means in the context of agent behavior. It knows that three identical tool calls in a row signals a loop, or that a tool returning null before a hallucinated response indicates a context loss. It transforms data into diagnosis.

#AgentDebugging #MetaAgent #TraceAnalysis #AIObservability #FailureDiagnosis #ProductionAI #AgenticAI #Debugging

Building an AI Agent Debugger: An Agent That Debugs Other Agents

The Debugging Problem in Agent Systems

Capturing Structured Execution Traces

Building the Trace Collector

The Debugger Agent

Running the Debugger on a Failed Trace

Automated Remediation

Failure Pattern Database

FAQ

Can the debugger agent itself fail, and how do you handle that?

How do you generate the "signature" for failure patterns?

What is the difference between this and traditional observability?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding