Measuring Agent User Experience: CSAT, SUS, and Custom UX Metrics for AI Products
Build a comprehensive UX measurement framework for AI agents using CSAT surveys, System Usability Scale, custom behavioral metrics, A/B testing strategies, and analytics pipelines.
You Cannot Improve What You Cannot Measure
Building a great AI agent UX requires continuous measurement. Intuition and user complaints are not enough — you need quantitative metrics that track experience quality over time, surface regressions quickly, and provide actionable data for improvement.
AI agents present unique measurement challenges. Traditional web analytics (page views, click-through rates) do not capture conversational quality. You need a layered approach combining survey-based metrics, behavioral signals, and AI-specific quality indicators.
CSAT: Customer Satisfaction Score
CSAT is the most straightforward UX metric. Ask users to rate their experience on a 1-5 scale at the end of an interaction:
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
class SurveyTrigger(Enum):
TASK_COMPLETED = "task_completed"
HUMAN_ESCALATION = "human_escalation"
SESSION_END = "session_end"
ERROR_RECOVERY = "error_recovery"
@dataclass
class CSATSurvey:
conversation_id: str
trigger: SurveyTrigger
rating: int | None # 1-5
comment: str | None
timestamp: datetime
task_type: str
turns_in_conversation: int
class CSATCollector:
"""Collect and analyze CSAT scores for agent interactions."""
SURVEY_MESSAGES = {
SurveyTrigger.TASK_COMPLETED: (
"I'm glad I could help! On a scale of 1-5, "
"how would you rate your experience today?"
),
SurveyTrigger.HUMAN_ESCALATION: (
"Before I transfer you, could you rate your experience "
"with me so far? (1-5, 5 being excellent)"
),
SurveyTrigger.ERROR_RECOVERY: (
"I know we hit a bump earlier. Now that it's resolved, "
"how would you rate the overall experience? (1-5)"
),
}
def should_survey(
self,
conversation_id: str,
trigger: SurveyTrigger,
recent_survey_count: int,
) -> bool:
"""Avoid survey fatigue — limit frequency."""
if recent_survey_count >= 1:
return False # Max one survey per session
if trigger == SurveyTrigger.SESSION_END:
return True
if trigger == SurveyTrigger.TASK_COMPLETED:
return True
return False
def calculate_csat_score(self, surveys: list[CSATSurvey]) -> dict:
"""Calculate CSAT percentage (% of 4 and 5 ratings)."""
rated = [s for s in surveys if s.rating is not None]
if not rated:
return {"score": None, "sample_size": 0}
satisfied = sum(1 for s in rated if s.rating >= 4)
return {
"score": round((satisfied / len(rated)) * 100, 1),
"sample_size": len(rated),
"average_rating": round(
sum(s.rating for s in rated) / len(rated), 2
),
}
Target a CSAT score of 80% or higher. Below 70% indicates a systemic UX problem.
System Usability Scale (SUS)
SUS is a standardized 10-question survey that produces a score from 0-100. It is ideal for periodic deep-dive assessments of your agent's usability:
SUS_QUESTIONS = [
"I think I would like to use this AI assistant frequently.",
"I found the AI assistant unnecessarily complex.",
"I thought the AI assistant was easy to use.",
"I think I would need technical support to use this assistant.",
"I found the various capabilities were well integrated.",
"I thought there was too much inconsistency in this assistant.",
"I imagine most people would learn to use this assistant quickly.",
"I found the assistant very cumbersome to use.",
"I felt very confident using the assistant.",
"I needed to learn a lot before I could use this assistant.",
]
# Questions alternate between positive and negative framing
POSITIVE_QUESTIONS = {0, 2, 4, 6, 8} # 0-indexed
def calculate_sus_score(responses: list[int]) -> float:
"""
Calculate SUS score from 10 responses (each 1-5).
Score ranges from 0 to 100. Above 68 is above average.
Above 80 is excellent.
"""
if len(responses) != 10:
raise ValueError("SUS requires exactly 10 responses")
adjusted = []
for i, response in enumerate(responses):
if i in POSITIVE_QUESTIONS:
adjusted.append(response - 1) # Positive: score - 1
else:
adjusted.append(5 - response) # Negative: 5 - score
return sum(adjusted) * 2.5
def interpret_sus_score(score: float) -> str:
if score >= 80.3:
return "Excellent (Grade A)"
elif score >= 68:
return "Good (Grade C) — above average"
elif score >= 51:
return "OK (Grade D) — below average, needs improvement"
else:
return "Poor (Grade F) — significant usability issues"
Custom Behavioral Metrics for AI Agents
Survey metrics capture stated satisfaction. Behavioral metrics capture actual usage patterns:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
@dataclass
class ConversationMetrics:
conversation_id: str
started_at: datetime
ended_at: datetime
total_turns: int
user_turns: int
agent_turns: int
task_completed: bool
escalated_to_human: bool
errors_encountered: int
errors_recovered: int
clarification_questions_asked: int
follow_up_prompts_clicked: int
user_rephrased_count: int # Times user had to rephrase
time_to_first_value: float # Seconds to first useful response
idle_gaps: list[float] # Seconds between user messages
def calculate_behavioral_health(
metrics: list[ConversationMetrics],
) -> dict:
"""Calculate aggregate behavioral health indicators."""
total = len(metrics)
if total == 0:
return {}
task_completion_rate = (
sum(1 for m in metrics if m.task_completed) / total * 100
)
escalation_rate = (
sum(1 for m in metrics if m.escalated_to_human) / total * 100
)
avg_turns_to_completion = (
sum(m.total_turns for m in metrics if m.task_completed)
/ max(sum(1 for m in metrics if m.task_completed), 1)
)
avg_rephrase_rate = (
sum(m.user_rephrased_count for m in metrics) / total
)
avg_time_to_value = (
sum(m.time_to_first_value for m in metrics) / total
)
error_recovery_rate = (
sum(m.errors_recovered for m in metrics)
/ max(sum(m.errors_encountered for m in metrics), 1)
* 100
)
return {
"task_completion_rate": round(task_completion_rate, 1),
"escalation_rate": round(escalation_rate, 1),
"avg_turns_to_completion": round(avg_turns_to_completion, 1),
"avg_rephrase_rate": round(avg_rephrase_rate, 2),
"avg_time_to_value_seconds": round(avg_time_to_value, 1),
"error_recovery_rate": round(error_recovery_rate, 1),
}
Key thresholds to watch: task completion rate below 70% means the agent is failing its core job. Rephrase rate above 1.5 per conversation means the agent is not understanding users. Time to first value above 30 seconds means the onboarding or first response is too slow.
A/B Testing UX Changes
Test UX changes rigorously before rolling them out:
import hashlib
from dataclasses import dataclass
@dataclass
class ABTestConfig:
test_id: str
variants: dict[str, dict] # variant_name -> config
traffic_split: dict[str, float] # variant_name -> percentage (0-1)
primary_metric: str
minimum_sample_size: int
def assign_variant(user_id: str, test: ABTestConfig) -> str:
"""Deterministically assign a user to a test variant."""
hash_input = f"{test.test_id}:{user_id}"
hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
bucket = (hash_value % 1000) / 1000.0
cumulative = 0.0
for variant, split in test.traffic_split.items():
cumulative += split
if bucket < cumulative:
return variant
return list(test.traffic_split.keys())[-1]
# Example: Testing a new greeting format
greeting_test = ABTestConfig(
test_id="greeting_v2_2026_03",
variants={
"control": {
"greeting_style": "list_capabilities",
"max_greeting_length": 200,
},
"treatment": {
"greeting_style": "single_question",
"max_greeting_length": 50,
},
},
traffic_split={"control": 0.5, "treatment": 0.5},
primary_metric="task_completion_rate",
minimum_sample_size=500,
)
Building an Analytics Dashboard
Aggregate all metrics into a single view that surfaces problems early:
@dataclass
class AgentHealthDashboard:
"""Daily snapshot of agent UX health."""
date: str
csat_score: float
task_completion_rate: float
avg_turns_to_completion: float
escalation_rate: float
error_rate: float
error_recovery_rate: float
avg_time_to_value: float
avg_rephrase_rate: float
active_ab_tests: list[str]
alerts: list[str]
def generate_daily_alerts(dashboard: AgentHealthDashboard) -> list[str]:
"""Generate alerts when metrics cross thresholds."""
alerts = []
if dashboard.csat_score < 70:
alerts.append(
f"CSAT dropped to {dashboard.csat_score}% — "
"investigate recent changes"
)
if dashboard.task_completion_rate < 65:
alerts.append(
f"Task completion at {dashboard.task_completion_rate}% — "
"check for broken flows"
)
if dashboard.escalation_rate > 30:
alerts.append(
f"Escalation rate at {dashboard.escalation_rate}% — "
"agent may be failing common intents"
)
if dashboard.avg_rephrase_rate > 2.0:
alerts.append(
f"Users rephrasing {dashboard.avg_rephrase_rate}x on average — "
"NLU needs tuning"
)
if dashboard.avg_time_to_value > 45:
alerts.append(
f"Time to value at {dashboard.avg_time_to_value}s — "
"first response too slow"
)
return alerts
Wire these alerts into your team's notification system (Slack, PagerDuty) so regressions are caught the same day they happen.
FAQ
How often should I collect CSAT surveys without causing survey fatigue?
Limit surveys to one per user session and no more than once per week for the same user. Rotate between end-of-task surveys and periodic in-depth surveys (like SUS). A 10-15% survey response rate is normal for in-product surveys — do not try to survey everyone. If your response rate drops below 5%, your survey prompt is too intrusive or too frequent.
What is the most important single metric for agent UX?
Task completion rate. If users cannot complete the task they came for, no amount of personality, formatting, or speed matters. Track it by task type (order lookup, returns, FAQ) so you can identify which specific flows are broken. A high overall completion rate can mask a 20% completion rate on a specific task that affects thousands of users.
How do I isolate whether a UX change or a model change caused a metric shift?
Never ship a UX change and a model change simultaneously. If your A/B test changes the greeting format at the same time you update the underlying model, you cannot attribute the metric movement. Use staged rollouts: ship the model change first, let metrics stabilize for a week, then launch the UX A/B test. If you must do both, use a 2x2 factorial design (old model + old UX, old model + new UX, new model + old UX, new model + new UX) but this requires 4x the sample size.
#UXMetrics #CSAT #Analytics #AIAgents #ABTesting #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.