Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

The conversational cliff

Humans expect a reply within roughly 500-700ms in natural conversation. Push past one second and callers feel like they are talking to a computer. Push past two seconds and they start talking over the agent and abandoning the call. Latency is not a nice-to-have in voice AI; it is the single biggest determinant of whether the conversation feels real.

This post walks through the full latency budget for a modern voice agent and the techniques that get you reliably under one second.

total = network + vad + stt + llm_first_token + llm_reasoning + tts_first_frame + playback

Architecture overview

caller                                           time budget
  │
  ├─► network_in          ─────►  40ms
  ├─► VAD decision        ─────► 150ms
  ├─► STT partial         ─────► 150ms (overlaps VAD)
  ├─► LLM first token     ─────► 250ms
  ├─► LLM finish          ─────► 150ms (streams during TTS)
  ├─► TTS first audio     ─────► 120ms
  ├─► network_out         ─────►  40ms
  └─► speaker             ─────►
                             ─────────
                   total  →   ~750ms

Prerequisites

A working voice agent pipeline.
An OpenTelemetry tracing backend (Honeycomb, Tempo, Jaeger).
The ability to measure wall-clock times at every boundary.

Step-by-step walkthrough

1. Instrument every stage with spans

from opentelemetry import trace
tracer = trace.get_tracer("voice-agent")

async def handle_turn(audio_in):
    with tracer.start_as_current_span("turn") as span:
        with tracer.start_as_current_span("vad"):
            ...  # VAD decision
        with tracer.start_as_current_span("stt"):
            ...
        with tracer.start_as_current_span("llm_first_token"):
            ...
        with tracer.start_as_current_span("tts_first_frame"):
            ...

2. Use streaming everything

Never wait for a stage to finish before starting the next. STT should emit partials, the LLM should stream tokens, TTS should stream audio frames. The end-of-turn signal is the only blocking event.

3. Collapse the pipeline

The OpenAI Realtime API removes three network hops by doing STT, LLM, and TTS in one WebSocket. That alone saves 200-400ms versus a DIY stack of Whisper + GPT + ElevenLabs as separate HTTP calls.

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    turn_detection: { type: "server_vad", silence_duration_ms: 400 },
    input_audio_format: "pcm16",
    output_audio_format: "pcm16",
  },
}));

4. Prewarm everything

At call setup, open the Realtime WebSocket before the caller says "hello". The TLS handshake and model load dominate first-turn latency otherwise.

async def on_incoming_ring(call_sid: str):
    session = await open_realtime_session()  # TLS + handshake now, not mid-call
    sessions[call_sid] = session

5. Keep tool calls off the hot path when possible

If a tool call takes >300ms, the agent should speak a filler ("let me pull that up") and stream it while the tool runs. The Realtime API makes this easy with response.create plus an instructions override.

6. Measure p50, p95, and p99 separately

Average latency hides the calls that feel broken. Track percentiles per stage and alert on p95.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Production considerations

Geography: keep the edge, the model, and the carrier in the same region. Cross-region adds 60-150ms.
Cold starts: if you run on serverless, warm pools are mandatory.
Network path: use private connectivity to your carrier if they offer it.
GC pauses: Node and Python both have them; profile under load.
Audio codec conversion: each resample costs 5-15ms. Do it once per direction.

CallSphere's real implementation

CallSphere targets and maintains sub-one-second end-to-end response time across every production vertical. The voice plane runs on the OpenAI Realtime API with gpt-4o-realtime-preview-2025-06-03, PCM16 at 24kHz, and server VAD — a single WebSocket per call, pre-warmed at ring, terminated at a FastAPI edge co-located with Twilio's media region.

The multi-agent topologies — 14 tools for healthcare, 10 for real estate, 4 for salon, 7 for after-hours escalation, 10 plus RAG for IT helpdesk, and the 5-specialist ElevenLabs sales pod — are all orchestrated through the OpenAI Agents SDK. Handoffs between agents reuse the same session so there is no TLS renegotiation mid-call, and post-call analytics from a GPT-4o-mini pipeline run asynchronously so they never contend with the hot audio path. CallSphere supports 57+ languages with the same budget.

Common pitfalls

Buffering audio for "smoothing": it adds latency for negligible quality gain.
Running STT in a separate HTTP request: you lose streaming.
Serial tool calls: parallelize them when the arguments are independent.
Logging in the hot path: async log emit, never block.
Ignoring p99: a 5% of calls that feel broken is a 5% churn signal.

FAQ

What is a realistic target?

Under 1 second at p50, under 1.4 seconds at p95.

Does the LLM model size matter?

Yes, but less than you think. The Realtime API's gpt-4o variant is already tuned for low first-token latency.

How much does TLS handshake cost?

40-120ms the first time, free on reuse.

Is WebRTC faster than Twilio Media Streams?

Marginally, because WebRTC uses UDP. Twilio over WebSocket is still plenty fast for production.

Can I reduce latency by running a local model?

Only if your local model beats the Realtime API end-to-end, which is rarely true today.

Next steps

Want to measure latency on your current stack? Book a demo to see how CallSphere hits sub-second on live traffic, read the technology page, or compare pricing.

#CallSphere #Latency #VoiceAI #Performance #OpenAIRealtime #Observability #AIVoiceAgents