Task Completion Rate: Measuring Whether AI Agents Actually Solve User Problems
A practical guide to defining, measuring, and improving task completion rate for AI agents, including handling partial completions, multi-step tasks, and ambiguous success criteria.
What Task Completion Rate Really Means
Task completion rate (TCR) answers the most fundamental question about an AI agent: did it actually solve the user's problem? A beautiful response that misses the point scores zero. A terse response that nails the answer scores one. TCR is the single metric that correlates most strongly with user satisfaction.
But measuring TCR is harder than it sounds. Real tasks are not binary. Users abandon conversations halfway through. Some tasks have multiple valid solutions. The agent might partially complete a task and leave the user to finish the rest. Your measurement framework must handle all of these cases.
Defining Task Success Criteria
Every task type needs explicit success criteria defined before you can measure completion. Here is a system for codifying those criteria.
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class CompletionStatus(Enum):
COMPLETE = "complete"
PARTIAL = "partial"
FAILED = "failed"
ABANDONED = "abandoned"
@dataclass
class SuccessCriterion:
name: str
check: callable # Returns True/False
required: bool = True
weight: float = 1.0
@dataclass
class TaskDefinition:
task_type: str
criteria: list[SuccessCriterion] = field(default_factory=list)
def evaluate(self, conversation: dict) -> tuple[CompletionStatus, float]:
if conversation.get("abandoned", False):
return CompletionStatus.ABANDONED, 0.0
required_results = []
optional_scores = []
for criterion in self.criteria:
passed = criterion.check(conversation)
if criterion.required:
required_results.append(passed)
else:
optional_scores.append(
criterion.weight if passed else 0.0
)
if all(required_results):
base_score = 1.0
elif any(required_results):
base_score = sum(required_results) / len(required_results)
else:
return CompletionStatus.FAILED, 0.0
# Add optional criteria bonus
if optional_scores:
max_bonus = sum(
c.weight for c in self.criteria if not c.required
)
bonus = sum(optional_scores) / max_bonus * 0.2
base_score = min(1.0, base_score + bonus)
status = (
CompletionStatus.COMPLETE
if base_score >= 0.95
else CompletionStatus.PARTIAL
)
return status, round(base_score, 3)
This model separates required criteria (must all pass for full completion) from optional criteria (bonus points for going above and beyond). A booking agent might require confirming the date and sending a confirmation, but get bonus points for proactively suggesting parking instructions.
Measuring Partial Completion
Binary pass/fail misses too much information. Partial completion tracking tells you exactly where the agent gets stuck.
@dataclass
class StepResult:
step_name: str
completed: bool
duration_ms: Optional[int] = None
error: Optional[str] = None
class MultiStepTracker:
def __init__(self, task_type: str, steps: list[str]):
self.task_type = task_type
self.expected_steps = steps
self.results: list[StepResult] = []
def record_step(
self, step_name: str, completed: bool,
duration_ms: int = None, error: str = None
):
self.results.append(StepResult(
step_name=step_name,
completed=completed,
duration_ms=duration_ms,
error=error,
))
def completion_ratio(self) -> float:
if not self.expected_steps:
return 0.0
completed = sum(
1 for r in self.results if r.completed
)
return completed / len(self.expected_steps)
def first_failure_point(self) -> Optional[str]:
for step in self.expected_steps:
result = next(
(r for r in self.results if r.step_name == step),
None,
)
if result is None or not result.completed:
return step
return None
The first_failure_point method is particularly valuable. When you aggregate across hundreds of conversations, it reveals the exact step where your agent most frequently breaks down. That is where you focus your improvement effort.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Handling Ambiguous and Open-Ended Tasks
Not every task has a clear right answer. For open-ended questions like "Help me plan a marketing strategy," you need a rubric-based approach.
import json
async def rubric_score_completion(
llm_client,
user_request: str,
agent_response: str,
rubric: list[str],
) -> dict:
rubric_text = "\n".join(
f"{i+1}. {item}" for i, item in enumerate(rubric)
)
prompt = f"""Evaluate whether the agent response adequately
addresses the user request. Score each rubric item 0 or 1.
User request: {user_request}
Agent response: {agent_response}
Rubric:
{rubric_text}
Return JSON: {{"scores": [0 or 1 for each item], "reasoning": "..."}}"""
response = await llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
result = json.loads(response.choices[0].message.content)
result["completion_rate"] = (
sum(result["scores"]) / len(result["scores"])
)
return result
The rubric converts subjective quality into a structured checklist that an LLM judge can score consistently. Always validate your rubric scoring against human judgments on a sample set before trusting it at scale.
Tracking TCR Over Time
Store completion data in a format that supports trend analysis and slicing by task type, user segment, or agent version.
from datetime import datetime
from collections import defaultdict
class TCRTracker:
def __init__(self):
self.records: list[dict] = []
def record(
self, task_type: str, status: CompletionStatus,
score: float, agent_version: str,
metadata: dict = None,
):
self.records.append({
"task_type": task_type,
"status": status.value,
"score": score,
"agent_version": agent_version,
"timestamp": datetime.utcnow().isoformat(),
"metadata": metadata or {},
})
def tcr_by_type(self) -> dict[str, float]:
grouped = defaultdict(list)
for r in self.records:
grouped[r["task_type"]].append(
1.0 if r["status"] == "complete" else 0.0
)
return {
task: sum(scores) / len(scores)
for task, scores in grouped.items()
}
def trend(self, task_type: str, window: int = 100) -> list[float]:
filtered = [
r for r in self.records
if r["task_type"] == task_type
]
if len(filtered) <= window:
return [sum(
1 for r in filtered if r["status"] == "complete"
) / max(len(filtered), 1)]
return [
sum(1 for r in filtered[i:i+window]
if r["status"] == "complete") / window
for i in range(0, len(filtered) - window + 1, window)
]
FAQ
What is a good task completion rate for a production AI agent?
It depends heavily on task complexity. Simple FAQ agents should target 90 percent or higher. Multi-step workflow agents typically land between 70 and 85 percent. Anything below 60 percent means the agent is creating more work than it saves. Track TCR by task type rather than averaging across all tasks — a single aggregate number hides critical weaknesses.
How do I handle tasks where the user abandons the conversation?
Track abandonment separately from failure. A user who leaves after getting the information they needed is different from one who leaves in frustration. Use signals like the last message content, time spent, and whether the agent asked a clarifying question right before the drop-off. Classify ambiguous abandonments as "unknown" rather than forcing them into success or failure.
Should partial completions count toward TCR?
Report both strict TCR (only full completions count) and weighted TCR (partial completions get proportional credit). Strict TCR sets the bar for customer experience. Weighted TCR gives your engineering team credit for incremental improvements and helps prioritize which remaining steps to fix first.
#TaskCompletion #AgentEvaluation #Metrics #Python #QualityAssurance #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.