Agentic AI KPIs: The Metrics That Matter for Agent Performance

You Cannot Improve What You Do Not Measure

Most agentic AI teams launch their agent, watch a handful of conversations, declare it working, and move on to building the next feature. Six months later, they cannot answer basic questions about their own system. Is the agent getting better or worse over time? Which types of conversations does it handle well and which does it botch? Where exactly is it failing and how often? How much does each conversation actually cost to serve?

The absence of a structured metrics framework is one of the most common failure modes in production agentic AI deployments. Without measurement, teams optimize based on anecdotes and gut feelings rather than data. They miss gradual performance degradation until customers start complaining. They cannot quantify the impact of prompt changes, model upgrades, or new tool integrations.

This guide defines a comprehensive KPI framework for agentic AI systems organized into three tiers: business outcome metrics that leadership and customers care about, agent quality metrics that the AI engineering team uses to improve behavior, and operational metrics that keep the system running reliably. It also covers dashboard design and alerting strategy based on patterns from CallSphere's production deployments across healthcare, real estate, and IT helpdesk verticals.

Tier 1: Business Outcome Metrics

These metrics measure whether the agent is delivering tangible business value. They are the metrics you report to customers, investors, and executive leadership.

Task Completion Rate

This is the single most important metric for any agentic AI system. It measures the percentage of conversations where the agent successfully completed the user's intended task without requiring human intervention.

To measure it accurately, define what completion means for each task type in your system. For appointment scheduling, completion means a confirmed appointment was created in the system. For lead qualification, completion means the lead was scored and routed to the appropriate sales rep. For IT ticket triage, completion means the ticket was categorized, prioritized, and either resolved automatically or assigned to the correct team.

Track the numerator and denominator separately. The denominator is total conversations where a completable task was identified. The numerator is conversations where that task was completed by the agent alone. This distinction matters because some conversations are purely informational and should not count against completion rate.

Mature deployments typically achieve 75 to 85 percent task completion. New deployments often start at 50 to 65 percent and improve through prompt iteration and tool refinement over four to eight weeks.

Segment this metric aggressively. Break it down by task type, by customer or tenant, by time of day, by conversation channel (voice versus chat), and by conversation complexity (number of turns required). Aggregated completion rates hide important patterns. An agent might complete 92 percent of simple appointment bookings but only 45 percent of complex reschedule requests involving insurance verification.

Escalation and Handoff Rate

This measures the percentage of conversations transferred to a human agent. Track two sub-categories separately because they have different implications. Agent-initiated escalations happen when the agent determines it cannot handle the request and proactively transfers to a human. This is healthy behavior that indicates the agent knows its limits. User-requested escalations happen when the caller or user explicitly asks to speak with a person. High rates here indicate the agent is failing to build trust or provide adequate assistance.

A combined escalation rate of 15 to 25 percent is normal for voice agents. Chat agents tend to be lower at 10 to 20 percent. Some escalation is healthy — you want the agent to recognize situations it cannot handle. Zero escalation rate would be a red flag indicating the agent is attempting tasks beyond its capabilities.

Customer Satisfaction Score

Measure how satisfied end users are with the agent interaction through post-conversation surveys (one question, one-to-five scale, sent via SMS or shown in the chat interface), automated sentiment analysis on the conversation transcript, and indirect signals like whether the user called back about the same issue within 24 hours, indicating the agent failed to resolve it.

Target 4.0 or higher on a five-point scale. Voice agents typically score lower than chat agents because voice interactions carry higher expectations and any latency or misunderstanding is more jarring in a spoken conversation.

Cost Per Conversation

This is the fully loaded cost to handle one conversation including LLM API costs, speech-to-text and text-to-speech processing, telephony or messaging platform fees, and a proportional allocation of infrastructure costs.

Instrument your agent pipeline to track costs at each step. Log the model used, input token count, output token count, and calculated cost for every LLM call. Log STT minutes and TTS characters processed. Log telephony minutes and any per-message charges. Aggregate these per conversation and per customer.

CallSphere tracks cost per conversation across all verticals and uses this data to identify optimization opportunities. For example, analyzing cost distributions revealed that appointment confirmation calls — short, scripted interactions — were using the same expensive model as complex scheduling conversations. Routing confirmations to a cheaper model reduced costs on those calls by 60 percent without any quality impact.

Tier 2: Agent Quality Metrics

These metrics measure the technical quality of agent behavior. They help the AI engineering and prompt engineering teams identify specific areas for improvement.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Tool Call Accuracy

This measures the percentage of tool calls where the agent selected the correct tool and provided correct parameters. Tool call errors are the most impactful failure mode in agentic AI because they lead to incorrect real-world actions — booking the wrong appointment time, sending a confirmation to the wrong patient, or looking up the wrong property listing.

Measuring tool accuracy requires either human evaluation where you sample 50 to 100 conversations per week and grade each tool call, or LLM-as-judge evaluation where a separate model reviews the conversation context and evaluates whether each tool call was appropriate. Both approaches have tradeoffs. Human evaluation is accurate but expensive and slow. LLM-as-judge is scalable but may miss subtle errors or disagree with human judgment on edge cases.

Target 95 percent or higher tool call accuracy. Track common failure patterns including wrong tool selected, correct tool with wrong parameters, unnecessary tool calls that waste tokens and latency, and missing tool calls where the agent responded without consulting a necessary data source.

Hallucination Rate

This measures the percentage of agent responses containing factually incorrect information not supported by the available tools, context, or system prompt. For business-facing agents, hallucinations are particularly dangerous because they can include fabricated appointment times that do not actually exist, incorrect business information like wrong addresses or hours, promises the business cannot keep, and medical or legal claims that create liability.

Sample conversations regularly and evaluate whether every factual claim the agent makes is traceable to data from a tool call, the system prompt, or the user's own statements. Target a hallucination rate below 2 percent. Even at 2 percent, if you handle 10,000 conversations per month, 200 of them contain false information.

Response Relevance and Helpfulness

Use LLM-as-judge evaluation with a structured scoring rubric to rate each agent response on relevance (did the response address the user's actual question or need), helpfulness (did the response move the conversation toward task completion), and completeness (did the response include all necessary information without requiring the user to ask follow-up questions for obvious details).

Score each dimension on a one-to-five scale. Target an average of 4.0 or higher across all dimensions.

Tier 3: Operational Metrics

These metrics help the engineering team maintain system health and identify infrastructure issues before they affect users.

Latency Percentiles

Measure how long it takes for the agent to generate a response at P50 (median), P95, and P99. For chat agents, target P50 under 2 seconds and P95 under 5 seconds. For voice agents, targets must be tighter because pauses in spoken conversation feel unnatural — target P50 under 1 second and P95 under 2 seconds.

Decompose latency into components: LLM inference time, tool execution time (database queries, API calls), network overhead, and speech processing time for voice agents. This decomposition reveals exactly where to focus optimization. If 70 percent of P95 latency comes from a single slow tool call, optimizing the LLM is wasted effort.

Token Usage and Efficiency

Log input and output tokens for every LLM call. Aggregate by conversation and track trends over time. Rising token counts with stable conversation lengths indicate prompt bloat — the system prompt or conversation context is growing without corresponding benefit. A conversation that uses 20,000 tokens to book an appointment is less efficient than one that accomplishes the same task in 8,000 tokens.

Error Rate by Category

Track errors in distinct categories: LLM API errors (rate limits, timeouts, 500s from the provider), tool execution failures (database connection errors, external API failures, timeout), conversation flow errors (agent stuck in a loop, lost context, contradicted itself), and input processing errors (failed to parse user intent, STT transcription errors for voice).

Each category has different root causes and remediation paths. Target a combined error rate below 1 percent with separate thresholds and alerting rules for each category.

Dashboard Design

Operations Dashboard

The primary engineering dashboard should show a real-time view of system health on a single screen. Include current conversation volume with a comparison to the same hour last week, error rate by category with automatic alerting at threshold, latency percentiles (P50, P95, P99) with an SLA indicator showing green, yellow, or red status, LLM provider status and response times, tool call success rates broken down by individual tool, and escalation queue depth showing conversations waiting for human handoff.

Quality Dashboard

The AI and prompt engineering team needs a dashboard focused on agent behavior quality. Include task completion rate with a 30-day trend line and segmentation by task type, hallucination rate from the most recent evaluation batch, tool call accuracy breakdown by individual tool, failed conversation samples with full conversation traces available for review, and prompt version tracking showing which prompt version is active and when it was deployed.

Per-Customer Dashboard

For multi-tenant products, provide customer-specific dashboards showing conversation volume and trends for the specific tenant, task completion rate compared to the fleet-wide average, the five most common failure modes for this customer's conversations, cost breakdown and projected monthly bill, and integration health showing whether connections to the customer's external systems are functioning.

Alerting Strategy

Define alert thresholds for critical metrics and route them to the appropriate team. Task completion rate dropping below 70 percent over a one-hour window indicates a systemic issue and should page the on-call engineer. P95 latency exceeding 10 seconds requires immediate investigation of infrastructure or provider issues. Error rate exceeding 5 percent suggests a deployment issue or external dependency failure. Hallucination rate exceeding 5 percent in a batch evaluation indicates a prompt or model regression. Cost per conversation spiking more than 50 percent above the seven-day rolling average may indicate a prompt change that increased token usage.

Operational alerts (latency, errors) go to the on-call engineering rotation. Quality alerts (completion rate, hallucination rate) go to the AI and prompt engineering team lead. Cost alerts go to the engineering manager or technical lead.

Frequently Asked Questions

How many conversations do I need before metrics are statistically meaningful?

For task completion rate and escalation rate, you need at least 100 conversations per segment to get reliable numbers with reasonable confidence intervals. For metrics that require sampling and evaluation like hallucination rate and tool accuracy, aim for at least 50 evaluated conversations per segment. At low volumes in the early days, focus on manually reviewing every single conversation rather than relying on aggregate statistics.

Should I use LLM-as-judge evaluation or human evaluation for quality metrics?

Use both at different frequencies. Run LLM-as-judge evaluation continuously on every conversation or a large sample for real-time quality monitoring. Conduct human evaluation on a smaller sample of 30 to 50 conversations per week to calibrate the LLM judge and catch errors the automated evaluation misses. If the LLM judge and human evaluators consistently disagree on specific types of issues, update the evaluation prompt to align them.

What is a good benchmark for cost per conversation?

It depends on the conversation type and channel. Simple FAQ-style chat conversations should cost $0.01 to $0.05. Transactional conversations involving booking or ordering should cost $0.05 to $0.20. Complex voice conversations with multiple tool calls typically cost $0.15 to $0.50. If your costs are significantly above these ranges, investigate model selection, prompt efficiency, unnecessary tool calls, and whether you are sending excessive context in each turn.

How quickly should I see metric improvements after a prompt change?

You should see measurable changes within 100 to 200 conversations after deploying a prompt update. If the metric does not move, the prompt change either did not affect the specific behavior you are measuring, or the change was too subtle to produce a statistically significant difference. Always define the expected metric impact before making a prompt change and measure against that expectation. This prevents the common failure mode of making many small prompt tweaks without knowing which ones actually helped.

How do I measure quality for voice conversations versus chat?

Voice conversations require the same quality metrics as chat plus additional voice-specific measurements. Run speech-to-text on the full conversation to generate a transcript, then apply the standard quality evaluation pipeline. Additionally measure speech overlap rate (how often the agent talks over the user), silence duration (pauses exceeding 2 seconds that indicate slow processing), and word error rate from the STT provider to understand whether transcription quality is affecting agent comprehension.