Latency Benchmarking for AI Agents: Measuring Time-to-First-Token and Total Response Time
A hands-on guide to measuring AI agent latency at every stage of the pipeline, from time-to-first-token through tool execution to total response time, with percentile reporting and SLA compliance tracking.
Why Latency Matters More Than You Think
Users tolerate a slow webpage for a few seconds. They abandon a slow conversational agent in moments. Research consistently shows that perceived agent intelligence drops when response times increase — the same answer delivered in 800 milliseconds feels smarter than the same answer delivered in 8 seconds. For AI agents, latency is not just a performance metric. It directly impacts perceived quality.
Agent latency is also more complex than web latency. A single response might involve an LLM call, two tool executions, another LLM call to synthesize results, and a final formatting step. You need to measure each segment independently to know where your time goes.
Defining Measurement Points
Agent latency has multiple stages. Instrument each one separately.
import time
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum
class LatencyStage(Enum):
PREPROCESSING = "preprocessing"
LLM_FIRST_TOKEN = "llm_first_token"
LLM_COMPLETE = "llm_complete"
TOOL_EXECUTION = "tool_execution"
POSTPROCESSING = "postprocessing"
TOTAL = "total"
@dataclass
class LatencyMeasurement:
stage: LatencyStage
duration_ms: float
metadata: dict = field(default_factory=dict)
class LatencyTracer:
def __init__(self):
self.measurements: list[LatencyMeasurement] = []
self._timers: dict[str, float] = {}
self._total_start: Optional[float] = None
def start_total(self):
self._total_start = time.perf_counter()
def start(self, stage: LatencyStage):
self._timers[stage.value] = time.perf_counter()
def stop(
self, stage: LatencyStage, metadata: dict = None
):
if stage.value not in self._timers:
return
elapsed = (
time.perf_counter() - self._timers[stage.value]
) * 1000
self.measurements.append(LatencyMeasurement(
stage=stage,
duration_ms=round(elapsed, 2),
metadata=metadata or {},
))
del self._timers[stage.value]
def stop_total(self):
if self._total_start is None:
return
elapsed = (
time.perf_counter() - self._total_start
) * 1000
self.measurements.append(LatencyMeasurement(
stage=LatencyStage.TOTAL,
duration_ms=round(elapsed, 2),
))
def summary(self) -> dict[str, float]:
return {
m.stage.value: m.duration_ms
for m in self.measurements
}
Use time.perf_counter() rather than time.time() — it provides monotonic, high-resolution timing that is not affected by system clock adjustments.
Measuring Time-to-First-Token
Time-to-first-token (TTFT) is the most important latency metric for user experience. It determines how long the user stares at a blank screen before seeing any response.
import asyncio
async def measure_ttft(
llm_client,
messages: list[dict],
model: str = "gpt-4o",
) -> dict:
start = time.perf_counter()
first_token_time = None
full_response = []
stream = await llm_client.chat.completions.create(
model=model,
messages=messages,
stream=True,
)
async for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.perf_counter()
full_response.append(
chunk.choices[0].delta.content
)
end = time.perf_counter()
ttft = (
(first_token_time - start) * 1000
if first_token_time
else None
)
return {
"ttft_ms": round(ttft, 2) if ttft else None,
"total_ms": round((end - start) * 1000, 2),
"token_count": len(full_response),
"tokens_per_second": round(
len(full_response) / (end - start), 1
) if end > start else 0,
}
TTFT under 500 milliseconds feels instant to users. Between 500ms and 1500ms is noticeable but acceptable. Above 2 seconds, you need a loading indicator or progressive streaming to maintain engagement.
Percentile-Based Reporting
Averages hide the worst experiences. Report latency using percentiles.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import statistics
from typing import Sequence
def latency_percentiles(
measurements_ms: Sequence[float],
) -> dict:
if not measurements_ms:
return {}
sorted_ms = sorted(measurements_ms)
n = len(sorted_ms)
def percentile(p: float) -> float:
idx = int(p / 100 * n)
idx = min(idx, n - 1)
return round(sorted_ms[idx], 2)
return {
"count": n,
"p50": percentile(50),
"p75": percentile(75),
"p90": percentile(90),
"p95": percentile(95),
"p99": percentile(99),
"mean": round(statistics.mean(sorted_ms), 2),
"stdev": round(
statistics.stdev(sorted_ms), 2
) if n > 1 else 0.0,
"min": round(sorted_ms[0], 2),
"max": round(sorted_ms[-1], 2),
}
Focus your SLA on p95 or p99, not the mean. If your p50 is 400ms but your p99 is 12 seconds, one in a hundred users is having a terrible experience and your average hides it completely.
SLA Compliance Tracking
Define latency SLAs per operation type and track compliance rates.
@dataclass
class LatencySLA:
operation: str
target_ms: float
percentile: float # e.g., 95.0 for p95
class SLATracker:
def __init__(self):
self.slas: list[LatencySLA] = []
self.measurements: dict[str, list[float]] = {}
def register_sla(self, sla: LatencySLA):
self.slas.append(sla)
self.measurements.setdefault(sla.operation, [])
def record(self, operation: str, latency_ms: float):
if operation in self.measurements:
self.measurements[operation].append(latency_ms)
def compliance_report(self) -> list[dict]:
report = []
for sla in self.slas:
data = self.measurements.get(sla.operation, [])
if not data:
report.append({
"operation": sla.operation,
"status": "no_data",
})
continue
percs = latency_percentiles(data)
p_key = f"p{int(sla.percentile)}"
actual = percs.get(p_key, 0)
compliant = actual <= sla.target_ms
report.append({
"operation": sla.operation,
"sla_target_ms": sla.target_ms,
"sla_percentile": sla.percentile,
"actual_ms": actual,
"compliant": compliant,
"margin_ms": round(sla.target_ms - actual, 2),
"sample_count": len(data),
})
return report
When compliance margin turns negative, you know exactly which operation is breaching its SLA and by how much. This drives targeted optimization rather than guessing.
Common Latency Optimization Strategies
Once you know where your time goes, apply targeted fixes. Preprocessing overhead can often be reduced by caching prompt templates. Tool execution latency drops with parallel tool calls when tools are independent. LLM latency improves with shorter prompts, smaller models for simple tasks, or prompt caching features offered by providers.
FAQ
What is a reasonable TTFT target for a production AI agent?
For chat-based agents, target a TTFT under 800 milliseconds at p95. For voice agents, you need under 500 milliseconds to feel conversational. If your agent uses tool calls before responding, consider sending a "thinking" indicator while tools execute, then stream the final answer. Users tolerate delays better when they see progress.
Should I measure latency in my evaluation pipeline or in production?
Both, but they measure different things. Evaluation pipeline latency tells you how fast the model and tools can run under controlled conditions. Production latency includes network hops, load balancer overhead, queue wait times, and contention from concurrent requests. Your evaluation pipeline sets a floor, and production metrics tell you how far above that floor you actually are.
How do I handle latency spikes from upstream LLM providers?
Implement circuit breakers with fallback models. If your primary model's latency exceeds a threshold for three consecutive requests, route to a faster fallback model. Track provider latency separately from your own processing time so you can distinguish between problems you can fix and problems you need to route around.
#Latency #Performance #Benchmarking #SLA #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.