Debugging Multi-Agent Workflows: Tracing Conversations Across Agent Boundaries

Why Multi-Agent Debugging Is Hard

Debugging a single agent is like debugging a single function — you check the input, trace the logic, and inspect the output. Debugging a multi-agent system is like debugging a distributed microservice architecture. The request flows through multiple agents, each making independent decisions, and the final failure might be caused by a decision made three agents ago.

Consider a customer support system where the user says "refund my last order" and gets a FAQ article about password resets. Was it the router that misclassified the intent? Did the FAQ agent receive the right handoff but interpret it wrong? Was the handoff description misleading? Without systematic debugging tools, you are guessing.

Layer 1: Structured Logging with RunContext

The first layer of multi-agent debugging is comprehensive logging. Build logging into your shared context so every agent action is captured:

from dataclasses import dataclass, field
from datetime import datetime
from typing import Any

@dataclass
class DebugContext:
    logs: list[dict] = field(default_factory=list)
    agent_transitions: list[dict] = field(default_factory=list)
    tool_calls: list[dict] = field(default_factory=list)

    def log_event(self, agent: str, event_type: str, details: str):
        self.logs.append({
            "timestamp": datetime.now().isoformat(),
            "agent": agent,
            "type": event_type,
            "details": details,
        })

    def log_transition(self, from_agent: str, to_agent: str, reason: str):
        self.agent_transitions.append({
            "timestamp": datetime.now().isoformat(),
            "from": from_agent,
            "to": to_agent,
            "reason": reason,
        })

    def log_tool_call(self, agent: str, tool: str, args: dict, result: str):
        self.tool_calls.append({
            "timestamp": datetime.now().isoformat(),
            "agent": agent,
            "tool": tool,
            "args": args,
            "result": result[:200],
        })

Instrument every tool to log its invocation:

from agents import RunContextWrapper, function_tool

@function_tool
def search_knowledge_base(
    ctx: RunContextWrapper[DebugContext],
    query: str,
) -> str:
    """Search the knowledge base."""
    result = f"Found 3 articles for: {query}"
    ctx.context.log_tool_call(
        agent="FAQ Agent",
        tool="search_knowledge_base",
        args={"query": query},
        result=result,
    )
    return result

After a run completes, you have a complete record of every event, every agent transition, and every tool call — regardless of how many agents were involved.

Layer 2: Using the SDK's Built-in Tracing

The OpenAI Agents SDK automatically creates traces for every Runner.run() call. These traces capture the full hierarchy of agent spans, generation spans, and function spans:

from agents import Agent, Runner, RunConfig

result = Runner.run_sync(
    router_agent,
    "I need a refund",
    run_config=RunConfig(
        workflow_name="customer-support-debug",
        trace_include_sensitive_data=True,
    ),
)

Setting workflow_name lets you filter traces in the OpenAI dashboard. The trace_include_sensitive_data flag (use only in development) includes full message content in traces, which is essential for debugging but should be disabled in production for privacy.

The trace reveals the execution hierarchy:

Trace: customer-support-debug
  +-- agent_span: Support Router (0.8s)
       +-- generation_span: gpt-4o-mini (0.6s) — routing decision
       +-- handoff: FAQ Agent
       +-- agent_span: FAQ Agent (1.2s)
            +-- generation_span: gpt-4o-mini (0.4s) — initial reasoning
            +-- function_span: search_knowledge_base (0.1s)
            +-- generation_span: gpt-4o-mini (0.7s) — response

This immediately shows a problem: the router sent a refund request to the FAQ Agent instead of the Billing Agent. The bug is in the routing decision, not in the FAQ Agent.

Layer 3: Debugging Common Multi-Agent Failures

Misrouting

The most common failure. The triage/router agent sends the request to the wrong specialist. Debug by examining:

The router's system prompt — are the routing criteria clear?
The handoff descriptions — does the correct agent's description match the request?
The generation span — what did the model reason before choosing the handoff?

Fix by improving handoff descriptions:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from agents import handoff

# Bad: vague description
handoff(billing_agent)  # Description auto-generated from agent name

# Good: explicit criteria
handoff(
    billing_agent,
    tool_description_override="Transfer to Billing Agent for refunds, invoices, payments, charges, subscription changes, and billing disputes",
)

Lost Context During Handoffs

Sometimes the target agent behaves as if it did not receive the context from the source agent. This happens when the conversation history is too long and important details get pushed out of the context window, or when the handoff strips context.

Debug by logging the conversation length at each handoff:

@function_tool
def log_context_size(
    ctx: RunContextWrapper[DebugContext],
) -> str:
    """Log the approximate context size for debugging."""
    log_count = len(ctx.context.logs)
    transition_count = len(ctx.context.agent_transitions)
    ctx.context.log_event("debug", "context_check", f"Logs: {log_count}, Transitions: {transition_count}")
    return f"Context has {log_count} log entries and {transition_count} transitions"

Infinite Handoff Loops

Agent A hands off to Agent B, which hands back to Agent A, which hands back to Agent B. The trace shows an endlessly growing chain of agent spans.

Prevent with max_turns and detect in logs:

from agents import Runner

result = Runner.run_sync(
    router_agent,
    user_message,
    max_turns=10,
)

After the run, check for loops:

def detect_handoff_loops(context: DebugContext) -> list[str]:
    """Detect circular handoff patterns in the log."""
    transitions = context.agent_transitions
    warnings = []
    for i in range(len(transitions) - 1):
        current = transitions[i]
        next_t = transitions[i + 1]
        if current["from"] == next_t["to"] and current["to"] == next_t["from"]:
            warnings.append(
                f"Loop detected: {current['from']} <-> {current['to']} "
                f"at {current['timestamp']}"
            )
    return warnings

Slow Agent Chains

When the overall response is too slow, you need to find the bottleneck. Parse the trace timeline to identify which agent or tool call consumed the most time:

def analyze_performance(context: DebugContext) -> str:
    """Analyze tool call performance from debug logs."""
    tool_times = {}
    for call in context.tool_calls:
        tool = call["tool"]
        if tool not in tool_times:
            tool_times[tool] = {"count": 0, "calls": []}
        tool_times[tool]["count"] += 1

    report = "Tool Call Summary:\n"
    for tool, data in sorted(tool_times.items(), key=lambda x: x[1]["count"], reverse=True):
        report += f"  {tool}: {data['count']} calls\n"
    return report

Layer 4: Replay Testing

The most powerful debugging technique is the ability to replay a failed conversation. Serialize the input, context, and agent configuration, then replay it in a controlled environment:

import json

def capture_replay_data(
    user_message: str,
    context: DebugContext,
    agent_name: str,
) -> str:
    """Capture everything needed to replay a conversation."""
    return json.dumps({
        "user_message": user_message,
        "agent": agent_name,
        "context_snapshot": {
            "logs": context.logs,
            "transitions": context.agent_transitions,
            "tool_calls": context.tool_calls,
        },
        "captured_at": datetime.now().isoformat(),
    }, indent=2)

def replay_from_capture(capture_json: str):
    """Replay a captured conversation for debugging."""
    data = json.loads(capture_json)
    print(f"Replaying message: {data['user_message']}")
    print(f"Original agent: {data['agent']}")
    print(f"Captured at: {data['captured_at']}")
    print(f"Original transitions: {len(data['context_snapshot']['transitions'])}")

    # Recreate context and re-run
    context = DebugContext()
    result = Runner.run_sync(
        router_agent,  # or whichever agent was first
        data["user_message"],
        context=context,
    )

    # Compare new transitions with original
    new_count = len(context.agent_transitions)
    orig_count = len(data["context_snapshot"]["transitions"])
    print(f"Original transitions: {orig_count}, Replay transitions: {new_count}")
    return result

Replay testing lets you reproduce bugs deterministically (or as close as possible with LLMs). Run the replay after each fix to confirm the issue is resolved.

Building a Debug Dashboard

For production systems, aggregate your debug data into a queryable format:

def generate_debug_report(context: DebugContext) -> str:
    """Generate a human-readable debug report."""
    report = []
    report.append("=== AGENT TRANSITIONS ===")
    for t in context.agent_transitions:
        report.append(f"  {t['from']} -> {t['to']}: {t['reason']}")

    report.append("\n=== TOOL CALLS ===")
    for tc in context.tool_calls:
        report.append(f"  [{tc['agent']}] {tc['tool']}({tc['args']}) -> {tc['result']}")

    report.append("\n=== EVENT LOG ===")
    for log in context.logs:
        report.append(f"  [{log['timestamp']}] {log['agent']}.{log['type']}: {log['details']}")

    loops = detect_handoff_loops(context)
    if loops:
        report.append("\n=== WARNINGS ===")
        for warning in loops:
            report.append(f"  {warning}")

    return "\n".join(report)

Run this after every conversation in development. In production, store the structured data in your observability platform and build alerts for handoff loops, high tool call counts, and slow response times.

FAQ

How do I debug a multi-agent system without access to the OpenAI dashboard?

Use the RunContext-based logging approach described above. Every tool call, transition, and event is captured in your own data structure. You can write these to a file, a database, or your own observability platform. The SDK tracing is a convenience, not a requirement.

Should I log full message content in production?

No. Full message content may contain personal data, payment information, or other sensitive content. Log metadata only — agent names, tool names, argument keys (not values), timestamps, and duration. For debugging specific incidents, enable full logging temporarily with appropriate access controls.

How do I write automated tests for multi-agent workflows?

Mock the LLM calls to return deterministic responses, then assert on the agent transition sequence and tool call sequence. For example, assert that a refund request always transitions Router -> Billing Agent and calls process_refund. Run these tests in CI to catch regressions when you change agent instructions or handoff configurations.

#Debugging #MultiAgentSystems #Tracing #OpenAIAgentsSDK #Observability #AgenticAI #LearnAI #AIEngineering