Skip to content
Learn Agentic AI11 min read0 views

Latency Benchmarking for AI Agents: Measuring Time-to-First-Token and Total Response Time

A hands-on guide to measuring AI agent latency at every stage of the pipeline, from time-to-first-token through tool execution to total response time, with percentile reporting and SLA compliance tracking.

Why Latency Matters More Than You Think

Users tolerate a slow webpage for a few seconds. They abandon a slow conversational agent in moments. Research consistently shows that perceived agent intelligence drops when response times increase — the same answer delivered in 800 milliseconds feels smarter than the same answer delivered in 8 seconds. For AI agents, latency is not just a performance metric. It directly impacts perceived quality.

Agent latency is also more complex than web latency. A single response might involve an LLM call, two tool executions, another LLM call to synthesize results, and a final formatting step. You need to measure each segment independently to know where your time goes.

Defining Measurement Points

Agent latency has multiple stages. Instrument each one separately.

import time
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum

class LatencyStage(Enum):
    PREPROCESSING = "preprocessing"
    LLM_FIRST_TOKEN = "llm_first_token"
    LLM_COMPLETE = "llm_complete"
    TOOL_EXECUTION = "tool_execution"
    POSTPROCESSING = "postprocessing"
    TOTAL = "total"

@dataclass
class LatencyMeasurement:
    stage: LatencyStage
    duration_ms: float
    metadata: dict = field(default_factory=dict)

class LatencyTracer:
    def __init__(self):
        self.measurements: list[LatencyMeasurement] = []
        self._timers: dict[str, float] = {}
        self._total_start: Optional[float] = None

    def start_total(self):
        self._total_start = time.perf_counter()

    def start(self, stage: LatencyStage):
        self._timers[stage.value] = time.perf_counter()

    def stop(
        self, stage: LatencyStage, metadata: dict = None
    ):
        if stage.value not in self._timers:
            return
        elapsed = (
            time.perf_counter() - self._timers[stage.value]
        ) * 1000
        self.measurements.append(LatencyMeasurement(
            stage=stage,
            duration_ms=round(elapsed, 2),
            metadata=metadata or {},
        ))
        del self._timers[stage.value]

    def stop_total(self):
        if self._total_start is None:
            return
        elapsed = (
            time.perf_counter() - self._total_start
        ) * 1000
        self.measurements.append(LatencyMeasurement(
            stage=LatencyStage.TOTAL,
            duration_ms=round(elapsed, 2),
        ))

    def summary(self) -> dict[str, float]:
        return {
            m.stage.value: m.duration_ms
            for m in self.measurements
        }

Use time.perf_counter() rather than time.time() — it provides monotonic, high-resolution timing that is not affected by system clock adjustments.

Measuring Time-to-First-Token

Time-to-first-token (TTFT) is the most important latency metric for user experience. It determines how long the user stares at a blank screen before seeing any response.

import asyncio

async def measure_ttft(
    llm_client,
    messages: list[dict],
    model: str = "gpt-4o",
) -> dict:
    start = time.perf_counter()
    first_token_time = None
    full_response = []

    stream = await llm_client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
    )

    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            full_response.append(
                chunk.choices[0].delta.content
            )

    end = time.perf_counter()
    ttft = (
        (first_token_time - start) * 1000
        if first_token_time
        else None
    )

    return {
        "ttft_ms": round(ttft, 2) if ttft else None,
        "total_ms": round((end - start) * 1000, 2),
        "token_count": len(full_response),
        "tokens_per_second": round(
            len(full_response) / (end - start), 1
        ) if end > start else 0,
    }

TTFT under 500 milliseconds feels instant to users. Between 500ms and 1500ms is noticeable but acceptable. Above 2 seconds, you need a loading indicator or progressive streaming to maintain engagement.

Percentile-Based Reporting

Averages hide the worst experiences. Report latency using percentiles.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import statistics
from typing import Sequence

def latency_percentiles(
    measurements_ms: Sequence[float],
) -> dict:
    if not measurements_ms:
        return {}

    sorted_ms = sorted(measurements_ms)
    n = len(sorted_ms)

    def percentile(p: float) -> float:
        idx = int(p / 100 * n)
        idx = min(idx, n - 1)
        return round(sorted_ms[idx], 2)

    return {
        "count": n,
        "p50": percentile(50),
        "p75": percentile(75),
        "p90": percentile(90),
        "p95": percentile(95),
        "p99": percentile(99),
        "mean": round(statistics.mean(sorted_ms), 2),
        "stdev": round(
            statistics.stdev(sorted_ms), 2
        ) if n > 1 else 0.0,
        "min": round(sorted_ms[0], 2),
        "max": round(sorted_ms[-1], 2),
    }

Focus your SLA on p95 or p99, not the mean. If your p50 is 400ms but your p99 is 12 seconds, one in a hundred users is having a terrible experience and your average hides it completely.

SLA Compliance Tracking

Define latency SLAs per operation type and track compliance rates.

@dataclass
class LatencySLA:
    operation: str
    target_ms: float
    percentile: float  # e.g., 95.0 for p95

class SLATracker:
    def __init__(self):
        self.slas: list[LatencySLA] = []
        self.measurements: dict[str, list[float]] = {}

    def register_sla(self, sla: LatencySLA):
        self.slas.append(sla)
        self.measurements.setdefault(sla.operation, [])

    def record(self, operation: str, latency_ms: float):
        if operation in self.measurements:
            self.measurements[operation].append(latency_ms)

    def compliance_report(self) -> list[dict]:
        report = []
        for sla in self.slas:
            data = self.measurements.get(sla.operation, [])
            if not data:
                report.append({
                    "operation": sla.operation,
                    "status": "no_data",
                })
                continue

            percs = latency_percentiles(data)
            p_key = f"p{int(sla.percentile)}"
            actual = percs.get(p_key, 0)
            compliant = actual <= sla.target_ms

            report.append({
                "operation": sla.operation,
                "sla_target_ms": sla.target_ms,
                "sla_percentile": sla.percentile,
                "actual_ms": actual,
                "compliant": compliant,
                "margin_ms": round(sla.target_ms - actual, 2),
                "sample_count": len(data),
            })
        return report

When compliance margin turns negative, you know exactly which operation is breaching its SLA and by how much. This drives targeted optimization rather than guessing.

Common Latency Optimization Strategies

Once you know where your time goes, apply targeted fixes. Preprocessing overhead can often be reduced by caching prompt templates. Tool execution latency drops with parallel tool calls when tools are independent. LLM latency improves with shorter prompts, smaller models for simple tasks, or prompt caching features offered by providers.

FAQ

What is a reasonable TTFT target for a production AI agent?

For chat-based agents, target a TTFT under 800 milliseconds at p95. For voice agents, you need under 500 milliseconds to feel conversational. If your agent uses tool calls before responding, consider sending a "thinking" indicator while tools execute, then stream the final answer. Users tolerate delays better when they see progress.

Should I measure latency in my evaluation pipeline or in production?

Both, but they measure different things. Evaluation pipeline latency tells you how fast the model and tools can run under controlled conditions. Production latency includes network hops, load balancer overhead, queue wait times, and contention from concurrent requests. Your evaluation pipeline sets a floor, and production metrics tell you how far above that floor you actually are.

How do I handle latency spikes from upstream LLM providers?

Implement circuit breakers with fallback models. If your primary model's latency exceeds a threshold for three consecutive requests, route to a faster fallback model. Track provider latency separately from your own processing time so you can distinguish between problems you can fix and problems you need to route around.


#Latency #Performance #Benchmarking #SLA #Python #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.