Agent Effectiveness Metrics: Resolution Rate, Containment, and First-Contact Resolution
Learn how to define, calculate, and benchmark the core effectiveness metrics for AI agents including resolution rate, containment rate, first-contact resolution, and strategies for systematic improvement.
The Metrics That Actually Matter
Deploying an AI agent is the easy part. Knowing whether it works well is hard. Teams that track vanity metrics like total conversations or average response time miss the real picture. The three metrics that define agent effectiveness are resolution rate, containment rate, and first-contact resolution.
These metrics answer the questions that stakeholders actually care about: Does the agent solve problems? Does it prevent escalations? Does it solve problems on the first try?
Metric Definitions
Understanding what each metric measures and how it differs from the others is essential before writing any calculation code.
Resolution Rate measures the percentage of conversations where the user's issue was actually solved. A conversation is resolved if the user confirms the solution worked or if the agent successfully completed the requested action.
Containment Rate measures the percentage of conversations handled entirely by the agent without human escalation. A contained conversation may or may not be resolved — the user might give up and leave, which counts as contained but unresolved.
First-Contact Resolution (FCR) measures the percentage of issues resolved in a single conversation, without the user needing to come back and ask again about the same problem.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from dataclasses import dataclass
from enum import Enum
class ConversationOutcome(Enum):
RESOLVED = "resolved"
UNRESOLVED = "unresolved"
ESCALATED = "escalated"
ABANDONED = "abandoned"
@dataclass
class ConversationRecord:
conversation_id: str
user_id: str
outcome: ConversationOutcome
escalated_to_human: bool
topic: str
message_count: int
duration_seconds: float
followup_conversation_id: str | None = None
Calculating the Core Metrics
With structured conversation records, the calculations themselves are straightforward. The challenge is getting accurate outcome labels, not doing the math.
class EffectivenessCalculator:
def __init__(self, records: list[ConversationRecord]):
self.records = records
def resolution_rate(self) -> float:
if not self.records:
return 0.0
resolved = sum(
1 for r in self.records
if r.outcome == ConversationOutcome.RESOLVED
)
return resolved / len(self.records) * 100
def containment_rate(self) -> float:
if not self.records:
return 0.0
contained = sum(
1 for r in self.records
if not r.escalated_to_human
)
return contained / len(self.records) * 100
def first_contact_resolution(self) -> float:
if not self.records:
return 0.0
resolved_no_followup = sum(
1 for r in self.records
if r.outcome == ConversationOutcome.RESOLVED
and r.followup_conversation_id is None
)
total_resolved = sum(
1 for r in self.records
if r.outcome == ConversationOutcome.RESOLVED
)
if total_resolved == 0:
return 0.0
return resolved_no_followup / total_resolved * 100
def summary(self) -> dict:
return {
"total_conversations": len(self.records),
"resolution_rate": round(self.resolution_rate(), 1),
"containment_rate": round(self.containment_rate(), 1),
"first_contact_resolution": round(
self.first_contact_resolution(), 1
),
}
Outcome Labeling
The hardest part of effectiveness measurement is determining the conversation outcome. There are three approaches: explicit user feedback, implicit signal detection, and LLM-based classification.
from openai import OpenAI
import json
client = OpenAI()
def classify_outcome(messages: list[dict]) -> ConversationOutcome:
formatted = "\n".join(
f"{m['role']}: {m['content']}" for m in messages
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"Classify this support conversation outcome. "
"Return JSON: {\"outcome\": \"resolved\" | "
"\"unresolved\" | \"escalated\" | \"abandoned\"}\n"
"resolved = user's issue was solved\n"
"unresolved = conversation ended without solving the issue\n"
"escalated = transferred to a human agent\n"
"abandoned = user stopped responding"
)},
{"role": "user", "content": formatted},
],
response_format={"type": "json_object"},
)
result = json.loads(response.choices[0].message.content)
return ConversationOutcome(result["outcome"])
Benchmarking and Improvement
Industry benchmarks give you a target to aim for. For customer support agents, a resolution rate above 70% is good, above 85% is excellent. Containment rates above 80% are typical for well-tuned agents. FCR above 75% indicates the agent is thorough in its responses.
BENCHMARKS = {
"resolution_rate": {"poor": 50, "good": 70, "excellent": 85},
"containment_rate": {"poor": 60, "good": 80, "excellent": 90},
"first_contact_resolution": {"poor": 50, "good": 65, "excellent": 80},
}
def benchmark_report(metrics: dict) -> list[dict]:
report = []
for metric, value in metrics.items():
if metric in BENCHMARKS:
thresholds = BENCHMARKS[metric]
if value >= thresholds["excellent"]:
rating = "excellent"
elif value >= thresholds["good"]:
rating = "good"
else:
rating = "needs improvement"
report.append({
"metric": metric,
"value": value,
"rating": rating,
"target": thresholds["excellent"],
"gap": round(thresholds["excellent"] - value, 1),
})
return report
def topic_breakdown(records: list[ConversationRecord]) -> dict:
from collections import defaultdict
topics: dict[str, list] = defaultdict(list)
for r in records:
topics[r.topic].append(r)
breakdown = {}
for topic, topic_records in topics.items():
calc = EffectivenessCalculator(topic_records)
breakdown[topic] = calc.summary()
return breakdown
FAQ
How do I handle conversations where the user never confirms resolution?
Use a combination of implicit signals and LLM classification. Implicit signals include the user saying "thanks" or "that worked," closing the chat window after receiving an answer, or not returning with the same issue within a defined window (e.g., 48 hours). LLM-based classification can catch subtler positive signals. Default to "unresolved" when uncertain — it is better to undercount resolutions than overcount them.
What is the relationship between containment rate and resolution rate?
They measure different things and can diverge significantly. A high containment rate with a low resolution rate means the agent keeps conversations but fails to solve problems — users give up rather than escalate. The ideal is high containment and high resolution together. If you must prioritize, resolution rate is more important because an unresolved contained conversation is a frustrated user.
How often should I recalculate these metrics?
Calculate daily aggregates and expose rolling 7-day and 30-day averages. Daily numbers are noisy, especially at lower volumes. The 7-day rolling average smooths out day-of-week effects while still showing trends. Set up alerts when the 7-day average drops more than 5 percentage points from its 30-day baseline.
#Metrics #ResolutionRate #KPIs #Analytics #AIAgents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.