Measuring Agent User Experience: CSAT, SUS, and Custom UX Metrics for AI Products

You Cannot Improve What You Cannot Measure

Building a great AI agent UX requires continuous measurement. Intuition and user complaints are not enough — you need quantitative metrics that track experience quality over time, surface regressions quickly, and provide actionable data for improvement.

AI agents present unique measurement challenges. Traditional web analytics (page views, click-through rates) do not capture conversational quality. You need a layered approach combining survey-based metrics, behavioral signals, and AI-specific quality indicators.

CSAT: Customer Satisfaction Score

CSAT is the most straightforward UX metric. Ask users to rate their experience on a 1-5 scale at the end of an interaction:

from dataclasses import dataclass
from datetime import datetime
from enum import Enum


class SurveyTrigger(Enum):
    TASK_COMPLETED = "task_completed"
    HUMAN_ESCALATION = "human_escalation"
    SESSION_END = "session_end"
    ERROR_RECOVERY = "error_recovery"


@dataclass
class CSATSurvey:
    conversation_id: str
    trigger: SurveyTrigger
    rating: int | None           # 1-5
    comment: str | None
    timestamp: datetime
    task_type: str
    turns_in_conversation: int


class CSATCollector:
    """Collect and analyze CSAT scores for agent interactions."""

    SURVEY_MESSAGES = {
        SurveyTrigger.TASK_COMPLETED: (
            "I'm glad I could help! On a scale of 1-5, "
            "how would you rate your experience today?"
        ),
        SurveyTrigger.HUMAN_ESCALATION: (
            "Before I transfer you, could you rate your experience "
            "with me so far? (1-5, 5 being excellent)"
        ),
        SurveyTrigger.ERROR_RECOVERY: (
            "I know we hit a bump earlier. Now that it's resolved, "
            "how would you rate the overall experience? (1-5)"
        ),
    }

    def should_survey(
        self,
        conversation_id: str,
        trigger: SurveyTrigger,
        recent_survey_count: int,
    ) -> bool:
        """Avoid survey fatigue — limit frequency."""
        if recent_survey_count >= 1:
            return False  # Max one survey per session
        if trigger == SurveyTrigger.SESSION_END:
            return True
        if trigger == SurveyTrigger.TASK_COMPLETED:
            return True
        return False

    def calculate_csat_score(self, surveys: list[CSATSurvey]) -> dict:
        """Calculate CSAT percentage (% of 4 and 5 ratings)."""
        rated = [s for s in surveys if s.rating is not None]
        if not rated:
            return {"score": None, "sample_size": 0}

        satisfied = sum(1 for s in rated if s.rating >= 4)
        return {
            "score": round((satisfied / len(rated)) * 100, 1),
            "sample_size": len(rated),
            "average_rating": round(
                sum(s.rating for s in rated) / len(rated), 2
            ),
        }

Target a CSAT score of 80% or higher. Below 70% indicates a systemic UX problem.

System Usability Scale (SUS)

SUS is a standardized 10-question survey that produces a score from 0-100. It is ideal for periodic deep-dive assessments of your agent's usability:

SUS_QUESTIONS = [
    "I think I would like to use this AI assistant frequently.",
    "I found the AI assistant unnecessarily complex.",
    "I thought the AI assistant was easy to use.",
    "I think I would need technical support to use this assistant.",
    "I found the various capabilities were well integrated.",
    "I thought there was too much inconsistency in this assistant.",
    "I imagine most people would learn to use this assistant quickly.",
    "I found the assistant very cumbersome to use.",
    "I felt very confident using the assistant.",
    "I needed to learn a lot before I could use this assistant.",
]

# Questions alternate between positive and negative framing
POSITIVE_QUESTIONS = {0, 2, 4, 6, 8}  # 0-indexed


def calculate_sus_score(responses: list[int]) -> float:
    """
    Calculate SUS score from 10 responses (each 1-5).
    Score ranges from 0 to 100. Above 68 is above average.
    Above 80 is excellent.
    """
    if len(responses) != 10:
        raise ValueError("SUS requires exactly 10 responses")

    adjusted = []
    for i, response in enumerate(responses):
        if i in POSITIVE_QUESTIONS:
            adjusted.append(response - 1)      # Positive: score - 1
        else:
            adjusted.append(5 - response)      # Negative: 5 - score

    return sum(adjusted) * 2.5


def interpret_sus_score(score: float) -> str:
    if score >= 80.3:
        return "Excellent (Grade A)"
    elif score >= 68:
        return "Good (Grade C) — above average"
    elif score >= 51:
        return "OK (Grade D) — below average, needs improvement"
    else:
        return "Poor (Grade F) — significant usability issues"

Custom Behavioral Metrics for AI Agents

Survey metrics capture stated satisfaction. Behavioral metrics capture actual usage patterns:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

@dataclass
class ConversationMetrics:
    conversation_id: str
    started_at: datetime
    ended_at: datetime
    total_turns: int
    user_turns: int
    agent_turns: int
    task_completed: bool
    escalated_to_human: bool
    errors_encountered: int
    errors_recovered: int
    clarification_questions_asked: int
    follow_up_prompts_clicked: int
    user_rephrased_count: int     # Times user had to rephrase
    time_to_first_value: float    # Seconds to first useful response
    idle_gaps: list[float]        # Seconds between user messages


def calculate_behavioral_health(
    metrics: list[ConversationMetrics],
) -> dict:
    """Calculate aggregate behavioral health indicators."""

    total = len(metrics)
    if total == 0:
        return {}

    task_completion_rate = (
        sum(1 for m in metrics if m.task_completed) / total * 100
    )

    escalation_rate = (
        sum(1 for m in metrics if m.escalated_to_human) / total * 100
    )

    avg_turns_to_completion = (
        sum(m.total_turns for m in metrics if m.task_completed)
        / max(sum(1 for m in metrics if m.task_completed), 1)
    )

    avg_rephrase_rate = (
        sum(m.user_rephrased_count for m in metrics) / total
    )

    avg_time_to_value = (
        sum(m.time_to_first_value for m in metrics) / total
    )

    error_recovery_rate = (
        sum(m.errors_recovered for m in metrics)
        / max(sum(m.errors_encountered for m in metrics), 1)
        * 100
    )

    return {
        "task_completion_rate": round(task_completion_rate, 1),
        "escalation_rate": round(escalation_rate, 1),
        "avg_turns_to_completion": round(avg_turns_to_completion, 1),
        "avg_rephrase_rate": round(avg_rephrase_rate, 2),
        "avg_time_to_value_seconds": round(avg_time_to_value, 1),
        "error_recovery_rate": round(error_recovery_rate, 1),
    }

Key thresholds to watch: task completion rate below 70% means the agent is failing its core job. Rephrase rate above 1.5 per conversation means the agent is not understanding users. Time to first value above 30 seconds means the onboarding or first response is too slow.

A/B Testing UX Changes

Test UX changes rigorously before rolling them out:

import hashlib
from dataclasses import dataclass


@dataclass
class ABTestConfig:
    test_id: str
    variants: dict[str, dict]   # variant_name -> config
    traffic_split: dict[str, float]  # variant_name -> percentage (0-1)
    primary_metric: str
    minimum_sample_size: int


def assign_variant(user_id: str, test: ABTestConfig) -> str:
    """Deterministically assign a user to a test variant."""
    hash_input = f"{test.test_id}:{user_id}"
    hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
    bucket = (hash_value % 1000) / 1000.0

    cumulative = 0.0
    for variant, split in test.traffic_split.items():
        cumulative += split
        if bucket < cumulative:
            return variant

    return list(test.traffic_split.keys())[-1]


# Example: Testing a new greeting format
greeting_test = ABTestConfig(
    test_id="greeting_v2_2026_03",
    variants={
        "control": {
            "greeting_style": "list_capabilities",
            "max_greeting_length": 200,
        },
        "treatment": {
            "greeting_style": "single_question",
            "max_greeting_length": 50,
        },
    },
    traffic_split={"control": 0.5, "treatment": 0.5},
    primary_metric="task_completion_rate",
    minimum_sample_size=500,
)

Building an Analytics Dashboard

Aggregate all metrics into a single view that surfaces problems early:

@dataclass
class AgentHealthDashboard:
    """Daily snapshot of agent UX health."""
    date: str
    csat_score: float
    task_completion_rate: float
    avg_turns_to_completion: float
    escalation_rate: float
    error_rate: float
    error_recovery_rate: float
    avg_time_to_value: float
    avg_rephrase_rate: float
    active_ab_tests: list[str]
    alerts: list[str]


def generate_daily_alerts(dashboard: AgentHealthDashboard) -> list[str]:
    """Generate alerts when metrics cross thresholds."""
    alerts = []

    if dashboard.csat_score < 70:
        alerts.append(
            f"CSAT dropped to {dashboard.csat_score}% — "
            "investigate recent changes"
        )

    if dashboard.task_completion_rate < 65:
        alerts.append(
            f"Task completion at {dashboard.task_completion_rate}% — "
            "check for broken flows"
        )

    if dashboard.escalation_rate > 30:
        alerts.append(
            f"Escalation rate at {dashboard.escalation_rate}% — "
            "agent may be failing common intents"
        )

    if dashboard.avg_rephrase_rate > 2.0:
        alerts.append(
            f"Users rephrasing {dashboard.avg_rephrase_rate}x on average — "
            "NLU needs tuning"
        )

    if dashboard.avg_time_to_value > 45:
        alerts.append(
            f"Time to value at {dashboard.avg_time_to_value}s — "
            "first response too slow"
        )

    return alerts

Wire these alerts into your team's notification system (Slack, PagerDuty) so regressions are caught the same day they happen.

FAQ

How often should I collect CSAT surveys without causing survey fatigue?

Limit surveys to one per user session and no more than once per week for the same user. Rotate between end-of-task surveys and periodic in-depth surveys (like SUS). A 10-15% survey response rate is normal for in-product surveys — do not try to survey everyone. If your response rate drops below 5%, your survey prompt is too intrusive or too frequent.

What is the most important single metric for agent UX?

Task completion rate. If users cannot complete the task they came for, no amount of personality, formatting, or speed matters. Track it by task type (order lookup, returns, FAQ) so you can identify which specific flows are broken. A high overall completion rate can mask a 20% completion rate on a specific task that affects thousands of users.

How do I isolate whether a UX change or a model change caused a metric shift?

Never ship a UX change and a model change simultaneously. If your A/B test changes the greeting format at the same time you update the underlying model, you cannot attribute the metric movement. Use staged rollouts: ship the model change first, let metrics stabilize for a week, then launch the UX A/B test. If you must do both, use a 2x2 factorial design (old model + old UX, old model + new UX, new model + old UX, new model + new UX) but this requires 4x the sample size.

#UXMetrics #CSAT #Analytics #AIAgents #ABTesting #AgenticAI #LearnAI #AIEngineering

Measuring Agent User Experience: CSAT, SUS, and Custom UX Metrics for AI Products

You Cannot Improve What You Cannot Measure

CSAT: Customer Satisfaction Score

System Usability Scale (SUS)

Custom Behavioral Metrics for AI Agents

A/B Testing UX Changes

Building an Analytics Dashboard

FAQ

How often should I collect CSAT surveys without causing survey fatigue?

What is the most important single metric for agent UX?

How do I isolate whether a UX change or a model change caused a metric shift?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding