AI Voice Agent Analytics: The KPIs That Actually Matter

If you are not measuring these, you are guessing

Voice agent dashboards tend to show whatever was easiest to build — total calls, total minutes, maybe sentiment. None of those tell you whether the agent is good at its job. This post lays out the 15 KPIs that actually matter for operating an AI voice agent and shows how to compute each one against a standard call log schema.

Every metric answers a question:
  • Did callers reach us?
  • Did the agent solve their problem?
  • How much did it cost?
  • Did anything go wrong?

Architecture overview

┌────────────────────┐
│ Voice agent runtime│
└─────────┬──────────┘
          │ call events
          ▼
┌────────────────────┐
│ calls table (OLTP) │
└─────────┬──────────┘
          │ CDC / copy
          ▼
┌────────────────────┐
│ analytics store    │
│ (ClickHouse / BQ)  │
└─────────┬──────────┘
          │
          ▼
┌────────────────────┐
│ dashboards + alerts│
└────────────────────┘

Prerequisites

A calls table with at minimum: call_id, started_at, ended_at, duration_sec, outcome, escalated, language, cost_cents.
A call_turns table with transcripts.
A call_events table (or enum column) with outcomes like resolved, escalated, abandoned.

The 15 KPIs

1. Answer rate

Percentage of inbound attempts that the agent actually picked up.

SELECT
  COUNT(*) FILTER (WHERE status = 'answered') * 1.0 / COUNT(*) AS answer_rate
FROM calls
WHERE started_at >= now() - interval '7 days';

2. Time to first word

How long from ring to the first syllable of the agent's greeting.

3. Average handle time (AHT)

4. First-contact resolution (FCR)

SELECT
  COUNT(*) FILTER (WHERE outcome = 'resolved' AND NOT followup_required) * 1.0 / COUNT(*) AS fcr
FROM calls;

5. Escalation rate

6. Containment rate

Inverse of escalation — the percentage of calls fully handled by the agent.

7. Abandon rate

8. Booking rate (for scheduling verticals)

9. Sentiment score

Aggregate from the post-call pipeline.

10. Cost per successful resolution

SELECT
  SUM(cost_cents) / NULLIF(SUM(CASE WHEN outcome = 'resolved' THEN 1 ELSE 0 END), 0) AS cpsr
FROM calls;

11. STT word error rate (WER)

Sample 1% of calls, have humans transcribe, compare.

12. Tool call success rate

13. Hallucination flag rate

From the post-call QA pipeline.

14. CSAT (when available)

15. Latency p95

Step-by-step walkthrough

1. Standardize the call log schema

CREATE TABLE calls (
  call_id TEXT PRIMARY KEY,
  started_at TIMESTAMPTZ NOT NULL,
  ended_at TIMESTAMPTZ,
  duration_sec INT,
  status TEXT NOT NULL,
  outcome TEXT,
  escalated BOOLEAN DEFAULT FALSE,
  followup_required BOOLEAN DEFAULT FALSE,
  language TEXT,
  cost_cents INT,
  agent_version TEXT
);

2. Compute metrics in batches

Run a 5-minute rollup job for dashboards and an hourly rollup for historical trends.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

3. Set SLOs and alert on p95

4. Expose the metrics in an admin UI

async function fetchKpis(from: string, to: string) {
  return await db.oneOrNone(
    "SELECT * FROM kpi_rollup WHERE period_start >= $1 AND period_end <= $2",
    [from, to],
  );
}

5. Build an evaluation harness

Take real calls, mask PII, and replay them against a staging agent to compare FCR and AHT across prompt versions.

Production considerations

Sampling: WER and hallucination checks need human labelers; sample, do not inspect all.
Cost attribution: Realtime API + TTS + Twilio + STT all contribute; track separately.
Version pinning: record which agent version handled each call for A/B comparisons.
PII in dashboards: mask caller IDs and names at the dashboard layer.
Retention: raw transcripts are sensitive; delete or tokenize after 30-90 days depending on vertical.

CallSphere's real implementation

CallSphere runs a GPT-4o-mini post-call analytics pipeline that writes sentiment, intent, lead score, satisfaction, and escalation flags into per-vertical Postgres databases. Those columns feed the 15 KPIs above in an admin dashboard every customer gets access to. The live voice plane runs the OpenAI Realtime API (gpt-4o-realtime-preview-2025-06-03) at 24kHz PCM16 with server VAD.

Across 14 healthcare tools, 10 real estate agents, 4 salon agents, 7 after-hours escalation tools, 10-plus-RAG IT helpdesk tools, and the 5-specialist ElevenLabs sales pod, KPIs are computed identically so customers can compare performance across verticals. The OpenAI Agents SDK orchestrates handoffs. CallSphere runs 57+ languages and sub-second end-to-end latency.

Common pitfalls

Averaging everything: p95 is what customers feel.
Counting minutes, not outcomes: minutes do not pay the bills, resolutions do.
Ignoring hallucination rate: it is the single biggest trust killer.
Skipping version tags: you cannot prove a prompt improvement without them.
Dashboards nobody looks at: build alerts before dashboards.

FAQ

What is a good FCR for an AI voice agent?

60-80% for well-scoped verticals, lower for open-ended support.

How do I measure CSAT without a post-call survey?

Use the GPT-4o-mini satisfaction score on the transcript as a proxy, validated by periodic real surveys.

What is a reasonable answer-rate target?

95% for always-on agents; the rest are config errors or carrier outages.

How do I avoid biasing the post-call LLM scorer?

Run it blind to agent version and spot-check with humans.

Can I compare my agent to humans directly?

Only against matched caller intents and with the same KPI definitions.

Next steps

Want a dashboard wired to real voice-agent KPIs? Book a demo, read the technology page, or see pricing.

#CallSphere #Analytics #KPIs #VoiceAI #Observability #Metrics #AIVoiceAgents