AI Voice Agent Analytics: The KPIs That Actually Matter
The 15 KPIs that matter for AI voice agent operations — from answer rate and FCR to cost per successful resolution.
If you are not measuring these, you are guessing
Voice agent dashboards tend to show whatever was easiest to build — total calls, total minutes, maybe sentiment. None of those tell you whether the agent is good at its job. This post lays out the 15 KPIs that actually matter for operating an AI voice agent and shows how to compute each one against a standard call log schema.
Every metric answers a question:
• Did callers reach us?
• Did the agent solve their problem?
• How much did it cost?
• Did anything go wrong?
Architecture overview
┌────────────────────┐
│ Voice agent runtime│
└─────────┬──────────┘
│ call events
▼
┌────────────────────┐
│ calls table (OLTP) │
└─────────┬──────────┘
│ CDC / copy
▼
┌────────────────────┐
│ analytics store │
│ (ClickHouse / BQ) │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ dashboards + alerts│
└────────────────────┘
Prerequisites
- A
callstable with at minimum:call_id,started_at,ended_at,duration_sec,outcome,escalated,language,cost_cents. - A
call_turnstable with transcripts. - A
call_eventstable (or enum column) with outcomes likeresolved,escalated,abandoned.
The 15 KPIs
1. Answer rate
Percentage of inbound attempts that the agent actually picked up.
SELECT
COUNT(*) FILTER (WHERE status = 'answered') * 1.0 / COUNT(*) AS answer_rate
FROM calls
WHERE started_at >= now() - interval '7 days';
2. Time to first word
How long from ring to the first syllable of the agent's greeting.
3. Average handle time (AHT)
4. First-contact resolution (FCR)
SELECT
COUNT(*) FILTER (WHERE outcome = 'resolved' AND NOT followup_required) * 1.0 / COUNT(*) AS fcr
FROM calls;
5. Escalation rate
6. Containment rate
Inverse of escalation — the percentage of calls fully handled by the agent.
7. Abandon rate
8. Booking rate (for scheduling verticals)
9. Sentiment score
Aggregate from the post-call pipeline.
10. Cost per successful resolution
SELECT
SUM(cost_cents) / NULLIF(SUM(CASE WHEN outcome = 'resolved' THEN 1 ELSE 0 END), 0) AS cpsr
FROM calls;
11. STT word error rate (WER)
Sample 1% of calls, have humans transcribe, compare.
12. Tool call success rate
13. Hallucination flag rate
From the post-call QA pipeline.
14. CSAT (when available)
15. Latency p95
Step-by-step walkthrough
1. Standardize the call log schema
CREATE TABLE calls (
call_id TEXT PRIMARY KEY,
started_at TIMESTAMPTZ NOT NULL,
ended_at TIMESTAMPTZ,
duration_sec INT,
status TEXT NOT NULL,
outcome TEXT,
escalated BOOLEAN DEFAULT FALSE,
followup_required BOOLEAN DEFAULT FALSE,
language TEXT,
cost_cents INT,
agent_version TEXT
);
2. Compute metrics in batches
Run a 5-minute rollup job for dashboards and an hourly rollup for historical trends.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
3. Set SLOs and alert on p95
4. Expose the metrics in an admin UI
async function fetchKpis(from: string, to: string) {
return await db.oneOrNone(
"SELECT * FROM kpi_rollup WHERE period_start >= $1 AND period_end <= $2",
[from, to],
);
}
5. Build an evaluation harness
Take real calls, mask PII, and replay them against a staging agent to compare FCR and AHT across prompt versions.
Production considerations
- Sampling: WER and hallucination checks need human labelers; sample, do not inspect all.
- Cost attribution: Realtime API + TTS + Twilio + STT all contribute; track separately.
- Version pinning: record which agent version handled each call for A/B comparisons.
- PII in dashboards: mask caller IDs and names at the dashboard layer.
- Retention: raw transcripts are sensitive; delete or tokenize after 30-90 days depending on vertical.
CallSphere's real implementation
CallSphere runs a GPT-4o-mini post-call analytics pipeline that writes sentiment, intent, lead score, satisfaction, and escalation flags into per-vertical Postgres databases. Those columns feed the 15 KPIs above in an admin dashboard every customer gets access to. The live voice plane runs the OpenAI Realtime API (gpt-4o-realtime-preview-2025-06-03) at 24kHz PCM16 with server VAD.
Across 14 healthcare tools, 10 real estate agents, 4 salon agents, 7 after-hours escalation tools, 10-plus-RAG IT helpdesk tools, and the 5-specialist ElevenLabs sales pod, KPIs are computed identically so customers can compare performance across verticals. The OpenAI Agents SDK orchestrates handoffs. CallSphere runs 57+ languages and sub-second end-to-end latency.
Common pitfalls
- Averaging everything: p95 is what customers feel.
- Counting minutes, not outcomes: minutes do not pay the bills, resolutions do.
- Ignoring hallucination rate: it is the single biggest trust killer.
- Skipping version tags: you cannot prove a prompt improvement without them.
- Dashboards nobody looks at: build alerts before dashboards.
FAQ
What is a good FCR for an AI voice agent?
60-80% for well-scoped verticals, lower for open-ended support.
How do I measure CSAT without a post-call survey?
Use the GPT-4o-mini satisfaction score on the transcript as a proxy, validated by periodic real surveys.
What is a reasonable answer-rate target?
95% for always-on agents; the rest are config errors or carrier outages.
How do I avoid biasing the post-call LLM scorer?
Run it blind to agent version and spot-check with humans.
Can I compare my agent to humans directly?
Only against matched caller intents and with the same KPI definitions.
Next steps
Want a dashboard wired to real voice-agent KPIs? Book a demo, read the technology page, or see pricing.
#CallSphere #Analytics #KPIs #VoiceAI #Observability #Metrics #AIVoiceAgents
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.