Agent Conversation Analytics: Understanding User Behavior and Agent Performance
Build conversation analytics for AI agents that measure success rates, identify drop-off points, track user satisfaction, and surface patterns that drive product and prompt improvements.
Beyond Uptime: Understanding How Agents Actually Perform
An agent can be online, fast, and error-free while still failing its users. If 40% of conversations end with the user rephrasing their question three times and then leaving, your monitoring will show green dashboards while your users are frustrated. Conversation analytics bridges this gap by measuring what matters from the user's perspective: Did the agent solve the problem? How many turns did it take? Where did users give up?
These analytics feed directly into product decisions — which features to build, which prompts to rewrite, and where to invest in better tooling.
Defining Conversation Events
Capture structured events throughout the conversation lifecycle. These events form the raw data for all downstream analytics.
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional
import uuid
class ConversationEvent(Enum):
STARTED = "started"
USER_MESSAGE = "user_message"
AGENT_RESPONSE = "agent_response"
TOOL_CALLED = "tool_called"
HANDOFF_REQUESTED = "handoff_requested"
FEEDBACK_RECEIVED = "feedback_received"
COMPLETED = "completed"
ABANDONED = "abandoned"
@dataclass
class EventRecord:
id: str = field(default_factory=lambda: str(uuid.uuid4()))
conversation_id: str = ""
user_id: str = ""
event_type: ConversationEvent = ConversationEvent.STARTED
timestamp: datetime = field(default_factory=datetime.utcnow)
metadata: dict = field(default_factory=dict)
class ConversationTracker:
def __init__(self, event_store):
self.store = event_store
async def record(
self,
conversation_id: str,
user_id: str,
event_type: ConversationEvent,
**metadata,
):
event = EventRecord(
conversation_id=conversation_id,
user_id=user_id,
event_type=event_type,
metadata=metadata,
)
await self.store.insert(event)
return event
Instrumenting the Agent
Emit events at each meaningful point in the conversation flow.
tracker = ConversationTracker(event_store)
async def run_conversation(user_message: str, user_id: str, conversation_id: str):
await tracker.record(
conversation_id, user_id,
ConversationEvent.STARTED,
channel="web",
)
turn_count = 0
while True:
turn_count += 1
await tracker.record(
conversation_id, user_id,
ConversationEvent.USER_MESSAGE,
message_length=len(user_message),
turn=turn_count,
)
response = await agent.run(user_message)
if response.tool_calls:
for tc in response.tool_calls:
await tracker.record(
conversation_id, user_id,
ConversationEvent.TOOL_CALLED,
tool_name=tc.function.name,
turn=turn_count,
)
await tracker.record(
conversation_id, user_id,
ConversationEvent.AGENT_RESPONSE,
response_length=len(response.content),
turn=turn_count,
model=response.model,
)
if is_conversation_complete(response):
await tracker.record(
conversation_id, user_id,
ConversationEvent.COMPLETED,
total_turns=turn_count,
)
break
user_message = await get_next_user_message()
if user_message is None: # User left
await tracker.record(
conversation_id, user_id,
ConversationEvent.ABANDONED,
abandoned_at_turn=turn_count,
)
break
return response.content
Key Analytics Queries
With events stored in a database, calculate the metrics that matter.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from sqlalchemy import text
async def get_conversation_metrics(db, days: int = 7):
"""Core conversation performance metrics."""
result = await db.execute(text("""
WITH conversations AS (
SELECT
conversation_id,
MIN(CASE WHEN event_type = 'started' THEN timestamp END) AS start_time,
MAX(CASE WHEN event_type = 'completed' THEN timestamp END) AS end_time,
BOOL_OR(event_type = 'completed') AS was_completed,
BOOL_OR(event_type = 'abandoned') AS was_abandoned,
BOOL_OR(event_type = 'handoff_requested') AS had_handoff,
COUNT(CASE WHEN event_type = 'user_message' THEN 1 END) AS user_turns
FROM conversation_events
WHERE timestamp >= NOW() - INTERVAL ':days days'
GROUP BY conversation_id
)
SELECT
COUNT(*) AS total_conversations,
ROUND(AVG(CASE WHEN was_completed THEN 1.0 ELSE 0.0 END) * 100, 1) AS completion_rate,
ROUND(AVG(CASE WHEN was_abandoned THEN 1.0 ELSE 0.0 END) * 100, 1) AS abandonment_rate,
ROUND(AVG(CASE WHEN had_handoff THEN 1.0 ELSE 0.0 END) * 100, 1) AS handoff_rate,
ROUND(AVG(user_turns), 1) AS avg_turns,
ROUND(AVG(EXTRACT(EPOCH FROM (end_time - start_time))), 0) AS avg_duration_seconds
FROM conversations
"""), {"days": days})
return result.fetchone()
async def get_drop_off_analysis(db, days: int = 7):
"""Find which turn users most commonly abandon at."""
result = await db.execute(text("""
SELECT
(metadata->>'abandoned_at_turn')::int AS abandon_turn,
COUNT(*) AS abandon_count
FROM conversation_events
WHERE event_type = 'abandoned'
AND timestamp >= NOW() - INTERVAL ':days days'
GROUP BY abandon_turn
ORDER BY abandon_count DESC
LIMIT 10
"""), {"days": days})
return result.fetchall()
Measuring User Satisfaction
Capture explicit feedback (thumbs up/down) and infer implicit satisfaction from behavior signals.
async def calculate_satisfaction_score(db, conversation_id: str) -> float:
"""Combine explicit and implicit satisfaction signals."""
events = await db.execute(text("""
SELECT event_type, metadata
FROM conversation_events
WHERE conversation_id = :cid
ORDER BY timestamp
"""), {"cid": conversation_id})
rows = events.fetchall()
signals = []
for row in rows:
if row.event_type == "feedback_received":
rating = row.metadata.get("rating")
if rating == "positive":
signals.append(1.0)
elif rating == "negative":
signals.append(0.0)
# Implicit signals
user_messages = [r for r in rows if r.event_type == "user_message"]
completed = any(r.event_type == "completed" for r in rows)
handoff = any(r.event_type == "handoff_requested" for r in rows)
if completed and len(user_messages) <= 3:
signals.append(0.9) # Resolved quickly
elif handoff:
signals.append(0.3) # Needed human help
elif not completed:
signals.append(0.1) # Abandoned
# Detect rephrasing (user sends similar messages multiple times)
if len(user_messages) > 2:
rephrase_penalty = max(0, (len(user_messages) - 3) * 0.1)
signals.append(max(0.0, 0.8 - rephrase_penalty))
return sum(signals) / len(signals) if signals else 0.5
FAQ
How do I detect that a user is rephrasing their question out of frustration?
Compare consecutive user messages using embedding similarity. If two sequential messages have cosine similarity above 0.85 but the words are different, the user is likely rephrasing because the agent did not understand or adequately address their first attempt. Track the rephrase rate as a key quality indicator — a rising rephrase rate is an early warning of prompt or retrieval degradation.
What is a good conversation completion rate?
It depends on the agent's domain. Customer support agents that handle well-scoped tasks should target 70-85% completion. General-purpose assistants might see 50-60% because users often explore or ask questions outside the agent's scope. More important than the absolute number is the trend — a 5% drop in completion rate over a week signals a real problem worth investigating.
Should I track analytics per agent or per conversation?
Both. Per-conversation analytics help you debug individual interactions and identify specific failure patterns. Per-agent analytics reveal systemic trends — which agent types perform best, which need prompt improvements, and how performance compares across models. Aggregate first by agent, then drill down into conversations for root cause analysis.
#ConversationAnalytics #UserBehavior #AgentPerformance #Metrics #AIAgents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.