Agent Effectiveness Metrics: Resolution Rate, Containment, and First-Contact Resolution

The Metrics That Actually Matter

Deploying an AI agent is the easy part. Knowing whether it works well is hard. Teams that track vanity metrics like total conversations or average response time miss the real picture. The three metrics that define agent effectiveness are resolution rate, containment rate, and first-contact resolution.

These metrics answer the questions that stakeholders actually care about: Does the agent solve problems? Does it prevent escalations? Does it solve problems on the first try?

Metric Definitions

Understanding what each metric measures and how it differs from the others is essential before writing any calculation code.

Resolution Rate measures the percentage of conversations where the user's issue was actually solved. A conversation is resolved if the user confirms the solution worked or if the agent successfully completed the requested action.

Containment Rate measures the percentage of conversations handled entirely by the agent without human escalation. A contained conversation may or may not be resolved — the user might give up and leave, which counts as contained but unresolved.

First-Contact Resolution (FCR) measures the percentage of issues resolved in a single conversation, without the user needing to come back and ask again about the same problem.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from dataclasses import dataclass
from enum import Enum

class ConversationOutcome(Enum):
    RESOLVED = "resolved"
    UNRESOLVED = "unresolved"
    ESCALATED = "escalated"
    ABANDONED = "abandoned"

@dataclass
class ConversationRecord:
    conversation_id: str
    user_id: str
    outcome: ConversationOutcome
    escalated_to_human: bool
    topic: str
    message_count: int
    duration_seconds: float
    followup_conversation_id: str | None = None

Calculating the Core Metrics

With structured conversation records, the calculations themselves are straightforward. The challenge is getting accurate outcome labels, not doing the math.

class EffectivenessCalculator:
    def __init__(self, records: list[ConversationRecord]):
        self.records = records

    def resolution_rate(self) -> float:
        if not self.records:
            return 0.0
        resolved = sum(
            1 for r in self.records
            if r.outcome == ConversationOutcome.RESOLVED
        )
        return resolved / len(self.records) * 100

    def containment_rate(self) -> float:
        if not self.records:
            return 0.0
        contained = sum(
            1 for r in self.records
            if not r.escalated_to_human
        )
        return contained / len(self.records) * 100

    def first_contact_resolution(self) -> float:
        if not self.records:
            return 0.0
        resolved_no_followup = sum(
            1 for r in self.records
            if r.outcome == ConversationOutcome.RESOLVED
            and r.followup_conversation_id is None
        )
        total_resolved = sum(
            1 for r in self.records
            if r.outcome == ConversationOutcome.RESOLVED
        )
        if total_resolved == 0:
            return 0.0
        return resolved_no_followup / total_resolved * 100

    def summary(self) -> dict:
        return {
            "total_conversations": len(self.records),
            "resolution_rate": round(self.resolution_rate(), 1),
            "containment_rate": round(self.containment_rate(), 1),
            "first_contact_resolution": round(
                self.first_contact_resolution(), 1
            ),
        }

Outcome Labeling

The hardest part of effectiveness measurement is determining the conversation outcome. There are three approaches: explicit user feedback, implicit signal detection, and LLM-based classification.

from openai import OpenAI
import json

client = OpenAI()

def classify_outcome(messages: list[dict]) -> ConversationOutcome:
    formatted = "\n".join(
        f"{m['role']}: {m['content']}" for m in messages
    )
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "Classify this support conversation outcome. "
                "Return JSON: {\"outcome\": \"resolved\" | "
                "\"unresolved\" | \"escalated\" | \"abandoned\"}\n"
                "resolved = user's issue was solved\n"
                "unresolved = conversation ended without solving the issue\n"
                "escalated = transferred to a human agent\n"
                "abandoned = user stopped responding"
            )},
            {"role": "user", "content": formatted},
        ],
        response_format={"type": "json_object"},
    )
    result = json.loads(response.choices[0].message.content)
    return ConversationOutcome(result["outcome"])

Benchmarking and Improvement

Industry benchmarks give you a target to aim for. For customer support agents, a resolution rate above 70% is good, above 85% is excellent. Containment rates above 80% are typical for well-tuned agents. FCR above 75% indicates the agent is thorough in its responses.

BENCHMARKS = {
    "resolution_rate": {"poor": 50, "good": 70, "excellent": 85},
    "containment_rate": {"poor": 60, "good": 80, "excellent": 90},
    "first_contact_resolution": {"poor": 50, "good": 65, "excellent": 80},
}

def benchmark_report(metrics: dict) -> list[dict]:
    report = []
    for metric, value in metrics.items():
        if metric in BENCHMARKS:
            thresholds = BENCHMARKS[metric]
            if value >= thresholds["excellent"]:
                rating = "excellent"
            elif value >= thresholds["good"]:
                rating = "good"
            else:
                rating = "needs improvement"
            report.append({
                "metric": metric,
                "value": value,
                "rating": rating,
                "target": thresholds["excellent"],
                "gap": round(thresholds["excellent"] - value, 1),
            })
    return report

def topic_breakdown(records: list[ConversationRecord]) -> dict:
    from collections import defaultdict
    topics: dict[str, list] = defaultdict(list)
    for r in records:
        topics[r.topic].append(r)
    breakdown = {}
    for topic, topic_records in topics.items():
        calc = EffectivenessCalculator(topic_records)
        breakdown[topic] = calc.summary()
    return breakdown

FAQ

How do I handle conversations where the user never confirms resolution?

Use a combination of implicit signals and LLM classification. Implicit signals include the user saying "thanks" or "that worked," closing the chat window after receiving an answer, or not returning with the same issue within a defined window (e.g., 48 hours). LLM-based classification can catch subtler positive signals. Default to "unresolved" when uncertain — it is better to undercount resolutions than overcount them.

What is the relationship between containment rate and resolution rate?

They measure different things and can diverge significantly. A high containment rate with a low resolution rate means the agent keeps conversations but fails to solve problems — users give up rather than escalate. The ideal is high containment and high resolution together. If you must prioritize, resolution rate is more important because an unresolved contained conversation is a frustrated user.

How often should I recalculate these metrics?

Calculate daily aggregates and expose rolling 7-day and 30-day averages. Daily numbers are noisy, especially at lower volumes. The 7-day rolling average smooths out day-of-week effects while still showing trends. Set up alerts when the 7-day average drops more than 5 percentage points from its 30-day baseline.

#Metrics #ResolutionRate #KPIs #Analytics #AIAgents #AgenticAI #LearnAI #AIEngineering

Agent Effectiveness Metrics: Resolution Rate, Containment, and First-Contact Resolution

The Metrics That Actually Matter

Metric Definitions

Calculating the Core Metrics

Outcome Labeling

Benchmarking and Improvement

FAQ

How do I handle conversations where the user never confirms resolution?

What is the relationship between containment rate and resolution rate?

How often should I recalculate these metrics?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding