Task Completion Rate: Measuring Whether AI Agents Actually Solve User Problems

What Task Completion Rate Really Means

Task completion rate (TCR) answers the most fundamental question about an AI agent: did it actually solve the user's problem? A beautiful response that misses the point scores zero. A terse response that nails the answer scores one. TCR is the single metric that correlates most strongly with user satisfaction.

But measuring TCR is harder than it sounds. Real tasks are not binary. Users abandon conversations halfway through. Some tasks have multiple valid solutions. The agent might partially complete a task and leave the user to finish the rest. Your measurement framework must handle all of these cases.

Defining Task Success Criteria

Every task type needs explicit success criteria defined before you can measure completion. Here is a system for codifying those criteria.

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional

class CompletionStatus(Enum):
    COMPLETE = "complete"
    PARTIAL = "partial"
    FAILED = "failed"
    ABANDONED = "abandoned"

@dataclass
class SuccessCriterion:
    name: str
    check: callable  # Returns True/False
    required: bool = True
    weight: float = 1.0

@dataclass
class TaskDefinition:
    task_type: str
    criteria: list[SuccessCriterion] = field(default_factory=list)

    def evaluate(self, conversation: dict) -> tuple[CompletionStatus, float]:
        if conversation.get("abandoned", False):
            return CompletionStatus.ABANDONED, 0.0

        required_results = []
        optional_scores = []

        for criterion in self.criteria:
            passed = criterion.check(conversation)
            if criterion.required:
                required_results.append(passed)
            else:
                optional_scores.append(
                    criterion.weight if passed else 0.0
                )

        if all(required_results):
            base_score = 1.0
        elif any(required_results):
            base_score = sum(required_results) / len(required_results)
        else:
            return CompletionStatus.FAILED, 0.0

        # Add optional criteria bonus
        if optional_scores:
            max_bonus = sum(
                c.weight for c in self.criteria if not c.required
            )
            bonus = sum(optional_scores) / max_bonus * 0.2
            base_score = min(1.0, base_score + bonus)

        status = (
            CompletionStatus.COMPLETE
            if base_score >= 0.95
            else CompletionStatus.PARTIAL
        )
        return status, round(base_score, 3)

This model separates required criteria (must all pass for full completion) from optional criteria (bonus points for going above and beyond). A booking agent might require confirming the date and sending a confirmation, but get bonus points for proactively suggesting parking instructions.

Measuring Partial Completion

Binary pass/fail misses too much information. Partial completion tracking tells you exactly where the agent gets stuck.

@dataclass
class StepResult:
    step_name: str
    completed: bool
    duration_ms: Optional[int] = None
    error: Optional[str] = None

class MultiStepTracker:
    def __init__(self, task_type: str, steps: list[str]):
        self.task_type = task_type
        self.expected_steps = steps
        self.results: list[StepResult] = []

    def record_step(
        self, step_name: str, completed: bool,
        duration_ms: int = None, error: str = None
    ):
        self.results.append(StepResult(
            step_name=step_name,
            completed=completed,
            duration_ms=duration_ms,
            error=error,
        ))

    def completion_ratio(self) -> float:
        if not self.expected_steps:
            return 0.0
        completed = sum(
            1 for r in self.results if r.completed
        )
        return completed / len(self.expected_steps)

    def first_failure_point(self) -> Optional[str]:
        for step in self.expected_steps:
            result = next(
                (r for r in self.results if r.step_name == step),
                None,
            )
            if result is None or not result.completed:
                return step
        return None

The first_failure_point method is particularly valuable. When you aggregate across hundreds of conversations, it reveals the exact step where your agent most frequently breaks down. That is where you focus your improvement effort.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Handling Ambiguous and Open-Ended Tasks

Not every task has a clear right answer. For open-ended questions like "Help me plan a marketing strategy," you need a rubric-based approach.

import json

async def rubric_score_completion(
    llm_client,
    user_request: str,
    agent_response: str,
    rubric: list[str],
) -> dict:
    rubric_text = "\n".join(
        f"{i+1}. {item}" for i, item in enumerate(rubric)
    )
    prompt = f"""Evaluate whether the agent response adequately
addresses the user request. Score each rubric item 0 or 1.

User request: {user_request}

Agent response: {agent_response}

Rubric:
{rubric_text}

Return JSON: {{"scores": [0 or 1 for each item], "reasoning": "..."}}"""

    response = await llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    result = json.loads(response.choices[0].message.content)
    result["completion_rate"] = (
        sum(result["scores"]) / len(result["scores"])
    )
    return result

The rubric converts subjective quality into a structured checklist that an LLM judge can score consistently. Always validate your rubric scoring against human judgments on a sample set before trusting it at scale.

Tracking TCR Over Time

Store completion data in a format that supports trend analysis and slicing by task type, user segment, or agent version.

from datetime import datetime
from collections import defaultdict

class TCRTracker:
    def __init__(self):
        self.records: list[dict] = []

    def record(
        self, task_type: str, status: CompletionStatus,
        score: float, agent_version: str,
        metadata: dict = None,
    ):
        self.records.append({
            "task_type": task_type,
            "status": status.value,
            "score": score,
            "agent_version": agent_version,
            "timestamp": datetime.utcnow().isoformat(),
            "metadata": metadata or {},
        })

    def tcr_by_type(self) -> dict[str, float]:
        grouped = defaultdict(list)
        for r in self.records:
            grouped[r["task_type"]].append(
                1.0 if r["status"] == "complete" else 0.0
            )
        return {
            task: sum(scores) / len(scores)
            for task, scores in grouped.items()
        }

    def trend(self, task_type: str, window: int = 100) -> list[float]:
        filtered = [
            r for r in self.records
            if r["task_type"] == task_type
        ]
        if len(filtered) <= window:
            return [sum(
                1 for r in filtered if r["status"] == "complete"
            ) / max(len(filtered), 1)]
        return [
            sum(1 for r in filtered[i:i+window]
                if r["status"] == "complete") / window
            for i in range(0, len(filtered) - window + 1, window)
        ]

FAQ

What is a good task completion rate for a production AI agent?

It depends heavily on task complexity. Simple FAQ agents should target 90 percent or higher. Multi-step workflow agents typically land between 70 and 85 percent. Anything below 60 percent means the agent is creating more work than it saves. Track TCR by task type rather than averaging across all tasks — a single aggregate number hides critical weaknesses.

How do I handle tasks where the user abandons the conversation?

Track abandonment separately from failure. A user who leaves after getting the information they needed is different from one who leaves in frustration. Use signals like the last message content, time spent, and whether the agent asked a clarifying question right before the drop-off. Classify ambiguous abandonments as "unknown" rather than forcing them into success or failure.

Should partial completions count toward TCR?

Report both strict TCR (only full completions count) and weighted TCR (partial completions get proportional credit). Strict TCR sets the bar for customer experience. Weighted TCR gives your engineering team credit for incremental improvements and helps prioritize which remaining steps to fix first.

#TaskCompletion #AgentEvaluation #Metrics #Python #QualityAssurance #AgenticAI #LearnAI #AIEngineering

Task Completion Rate: Measuring Whether AI Agents Actually Solve User Problems

What Task Completion Rate Really Means

Defining Task Success Criteria

Measuring Partial Completion

Handling Ambiguous and Open-Ended Tasks

Tracking TCR Over Time

FAQ

What is a good task completion rate for a production AI agent?

How do I handle tasks where the user abandons the conversation?

Should partial completions count toward TCR?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding