Multi-Agent System Evaluation: Measuring Coordination Quality and Handoff Success

The Unique Challenge of Multi-Agent Evaluation

Evaluating a single agent is hard enough. Evaluating a system of agents that coordinate, delegate, and hand off to each other introduces entirely new failure modes. The individual agents might each perform well in isolation, yet the system fails because information gets lost during handoffs, the wrong agent receives a task, or two agents give contradictory answers to the same user.

Multi-agent evaluation requires metrics that span agent boundaries: handoff accuracy, information retention, routing correctness, and end-to-end coherence. You cannot get these by evaluating each agent independently.

Modeling Multi-Agent Conversations

Start by structuring how you represent multi-agent interactions for evaluation.

from dataclasses import dataclass, field
from typing import Optional
from enum import Enum

class HandoffReason(Enum):
    CAPABILITY_MATCH = "capability_match"
    ESCALATION = "escalation"
    SPECIALIZATION = "specialization"
    FALLBACK = "fallback"

@dataclass
class AgentTurn:
    agent_id: str
    agent_role: str
    message: str
    tool_calls: list[dict] = field(default_factory=list)
    turn_index: int = 0

@dataclass
class Handoff:
    from_agent: str
    to_agent: str
    reason: HandoffReason
    context_passed: dict = field(default_factory=dict)
    turn_index: int = 0

@dataclass
class MultiAgentTrace:
    conversation_id: str
    turns: list[AgentTurn] = field(default_factory=list)
    handoffs: list[Handoff] = field(default_factory=list)
    user_messages: list[dict] = field(default_factory=list)

    def agents_involved(self) -> list[str]:
        seen = []
        for turn in self.turns:
            if turn.agent_id not in seen:
                seen.append(turn.agent_id)
        return seen

    def handoff_count(self) -> int:
        return len(self.handoffs)

    def turns_per_agent(self) -> dict[str, int]:
        counts = {}
        for turn in self.turns:
            counts[turn.agent_id] = (
                counts.get(turn.agent_id, 0) + 1
            )
        return counts

This trace captures the full conversation timeline with which agent spoke when, every handoff event, and the context that was passed during each transition.

Measuring Handoff Accuracy

A handoff is accurate when the right agent receives the task for the right reason with the right context.

@dataclass
class HandoffExpectation:
    expected_target: str
    expected_reason: HandoffReason
    required_context_keys: list[str] = field(
        default_factory=list
    )

def score_handoff_accuracy(
    actual_handoffs: list[Handoff],
    expected: list[HandoffExpectation],
) -> dict:
    if not expected:
        return {
            "handoff_accuracy": 1.0 if not actual_handoffs else 0.0,
            "unexpected_handoffs": len(actual_handoffs),
        }

    results = []
    for i, exp in enumerate(expected):
        if i >= len(actual_handoffs):
            results.append({
                "index": i,
                "target_correct": False,
                "reason_correct": False,
                "context_complete": False,
                "status": "missing",
            })
            continue

        actual = actual_handoffs[i]
        target_ok = actual.to_agent == exp.expected_target
        reason_ok = actual.reason == exp.expected_reason
        context_ok = all(
            key in actual.context_passed
            for key in exp.required_context_keys
        )

        results.append({
            "index": i,
            "target_correct": target_ok,
            "reason_correct": reason_ok,
            "context_complete": context_ok,
            "actual_target": actual.to_agent,
            "expected_target": exp.expected_target,
        })

    target_accuracy = sum(
        1 for r in results if r["target_correct"]
    ) / len(results)
    context_completeness = sum(
        1 for r in results if r.get("context_complete", False)
    ) / len(results)

    return {
        "target_accuracy": round(target_accuracy, 3),
        "context_completeness": round(context_completeness, 3),
        "handoff_details": results,
        "unexpected_handoffs": max(
            0, len(actual_handoffs) - len(expected)
        ),
    }

Context completeness is the most frequently overlooked metric. An agent might route to the correct specialist, but if it drops the customer's account number during the handoff, the specialist has to ask for it again — creating a frustrating user experience.

Information Retention Across Handoffs

Measure whether information mentioned before a handoff is available and used after it.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

async def score_information_retention(
    llm_client,
    pre_handoff_messages: list[str],
    post_handoff_messages: list[str],
    key_facts: list[str],
) -> dict:
    facts_text = "\n".join(
        f"- {fact}" for fact in key_facts
    )
    post_text = "\n".join(post_handoff_messages[:5])

    prompt = f"""Evaluate whether key information from before
the agent handoff is retained and used after the handoff.

## Key Facts (established before handoff)
{facts_text}

## Post-Handoff Agent Messages
{post_text}

For each fact, determine:
- "retained": the agent demonstrates awareness of this fact
- "lost": the agent ignores or re-asks for this information
- "contradicted": the agent states something conflicting

Return JSON:
{{
  "facts": [
    {{"fact": "...", "status": "retained|lost|contradicted"}}
  ]
}}"""

    response = await llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    import json
    result = json.loads(response.choices[0].message.content)

    statuses = [f["status"] for f in result["facts"]]
    retained = statuses.count("retained")
    return {
        "retention_rate": round(
            retained / len(statuses), 3
        ) if statuses else 1.0,
        "retained": retained,
        "lost": statuses.count("lost"),
        "contradicted": statuses.count("contradicted"),
        "details": result["facts"],
    }

A contradicted fact is worse than a lost one. If the first agent says "Your appointment is on Tuesday" and the second agent says "Your appointment is on Thursday," the user loses trust in the entire system.

Routing Correctness Evaluation

In systems with a triage or router agent, measure whether user intents get sent to the right specialist.

@dataclass
class RoutingTestCase:
    user_input: str
    correct_agent: str
    acceptable_agents: list[str] = field(
        default_factory=list
    )

def score_routing(
    test_cases: list[RoutingTestCase],
    actual_routes: list[str],
) -> dict:
    exact_matches = 0
    acceptable_matches = 0

    for case, actual in zip(test_cases, actual_routes):
        if actual == case.correct_agent:
            exact_matches += 1
            acceptable_matches += 1
        elif actual in case.acceptable_agents:
            acceptable_matches += 1

    n = len(test_cases)
    return {
        "exact_routing_accuracy": round(
            exact_matches / n, 3
        ) if n else 0.0,
        "acceptable_routing_accuracy": round(
            acceptable_matches / n, 3
        ) if n else 0.0,
        "total_cases": n,
        "misrouted": n - acceptable_matches,
    }

The distinction between exact and acceptable routing matters. If a billing question goes to the general support agent instead of the billing specialist, that is suboptimal but acceptable. If it goes to the technical troubleshooting agent, that is a misroute.

End-to-End Coordination Score

Combine all multi-agent metrics into a single coordination quality score.

def coordination_score(
    handoff_report: dict,
    retention_report: dict,
    routing_report: dict,
) -> dict:
    handoff_score = handoff_report.get(
        "target_accuracy", 0
    ) * 0.5 + handoff_report.get(
        "context_completeness", 0
    ) * 0.5

    retention_score = retention_report.get(
        "retention_rate", 0
    )
    # Penalize contradictions heavily
    contradictions = retention_report.get("contradicted", 0)
    retention_score = max(
        0, retention_score - contradictions * 0.2
    )

    routing_score = routing_report.get(
        "acceptable_routing_accuracy", 0
    )

    composite = (
        handoff_score * 0.3
        + retention_score * 0.4
        + routing_score * 0.3
    )

    return {
        "handoff_quality": round(handoff_score, 3),
        "information_retention": round(retention_score, 3),
        "routing_quality": round(routing_score, 3),
        "composite_coordination": round(composite, 3),
    }

Information retention gets the highest weight because it has the strongest correlation with user satisfaction. Users can tolerate a brief misroute that gets corrected. They cannot tolerate repeating themselves after every handoff.

FAQ

How do I test handoffs when agents are developed by different teams?

Define a handoff contract — a schema that specifies exactly what context fields must be passed during each type of handoff. Each team tests that their agent produces the correct output contract and correctly consumes the input contract. Then run end-to-end integration tests that verify the contracts work together. This is analogous to API contract testing in microservices.

What is a good routing accuracy target for a triage agent?

Target 90 percent or higher acceptable routing accuracy. Below 85 percent, users will notice frequent misroutes. For systems with only two or three specialist agents, you should aim for 95 percent because the routing task is simpler. As the number of specialists grows, acceptable accuracy naturally drops — consider hierarchical routing (triage to category, then category to specialist) to maintain high accuracy.

How do I handle circular handoffs where agents keep passing the user back and forth?

Detect circular handoffs by tracking the agent sequence. If the same pair of agents hand off to each other more than once in a conversation, flag it as a coordination failure. Set a maximum handoff count per conversation (typically three to five) and escalate to a human when the limit is reached. Log circular patterns to identify systemic gaps in agent capabilities.

#MultiAgent #AgentHandoff #Evaluation #Orchestration #Python #AgenticAI #LearnAI #AIEngineering

Multi-Agent System Evaluation: Measuring Coordination Quality and Handoff Success

The Unique Challenge of Multi-Agent Evaluation

Modeling Multi-Agent Conversations

Measuring Handoff Accuracy

Information Retention Across Handoffs

Routing Correctness Evaluation

End-to-End Coordination Score

FAQ

How do I test handoffs when agents are developed by different teams?

What is a good routing accuracy target for a triage agent?

How do I handle circular handoffs where agents keep passing the user back and forth?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding