How AI Voice Agents Actually Work: Technical Deep Dive (2026 Edition)
A full technical walkthrough of how modern AI voice agents work — speech-to-text, LLM orchestration, TTS, tool calling, and sub-second latency.
The Problem Nobody Warns You About
The first time you build a voice agent that actually works, you notice something strange: the model is smart, the transcription is correct, the voice sounds great — and yet the conversation feels broken. The caller says "hello" and waits two full seconds. They interrupt and the agent keeps talking over them. They ask a question and the agent hallucinates a policy that doesn't exist in your knowledge base.
None of those problems are language model problems. They are systems problems. Voice agents are a distributed, soft-real-time pipeline where every component — microphone capture, VAD, STT, LLM, tool execution, TTS, speaker playback — has to hit a latency budget measured in milliseconds, and has to fail gracefully when any stage misbehaves.
Here is the shape of the pipeline most teams miss when they read "just use the Realtime API":
caller mic
↓ (PCM16 @ 24kHz)
carrier / WebRTC bridge
↓
server VAD → interruption signal
↓
STT (streaming)
↓ (partial transcripts)
LLM reasoning + tool calls
↓ (token stream)
TTS (streaming)
↓ (audio frames)
speaker
This post is a full technical walkthrough of how modern AI voice agents work in 2026. It is based on the architecture CallSphere runs in production across healthcare, real estate, salon, after-hours escalation, IT helpdesk, and sales verticals — all of which handle live phone traffic today.
Architecture overview
┌─────────────────────────────────────────────────────────────┐
│ Caller (PSTN / WebRTC) │
└─────────────────────────────────────────────────────────────┘
│ G.711 ulaw / Opus
▼
┌─────────────────────────────────────────────────────────────┐
│ Twilio Media Streams ←→ Edge bridge (FastAPI WebSocket) │
└─────────────────────────────────────────────────────────────┘
│ PCM16 @ 24kHz
▼
┌─────────────────────────────────────────────────────────────┐
│ OpenAI Realtime API (gpt-4o-realtime-preview-2025-06-03) │
│ • Server VAD • Streaming STT │
│ • Function calling • Streaming TTS │
└─────────────────────────────────────────────────────────────┘
│ tool calls + audio frames
▼
┌─────────────────────────────────────────────────────────────┐
│ Tool layer: calendar, CRM, DB, payments, handoff │
│ Observability: OpenTelemetry spans per stage │
│ Post-call: GPT-4o-mini summary + sentiment + lead score │
└─────────────────────────────────────────────────────────────┘
Prerequisites
- Working knowledge of WebSockets and async Python or Node.js.
- An OpenAI account with Realtime API access.
- A Twilio account (or any SIP provider that supports Media Streams / bidirectional audio).
- Familiarity with audio formats: PCM16, sample rates, and G.711 ulaw.
- A Postgres database for session state and call logs.
- Comfort with OpenTelemetry or an equivalent tracing backend.
Step-by-step walkthrough
1. Capture audio at the edge
Your edge service receives audio frames over a WebSocket from the carrier and must forward them to the model without blocking. Back-pressure matters: if you buffer too much, latency explodes; if you buffer too little, you clip the caller.
from fastapi import FastAPI, WebSocket
import asyncio, base64, json, websockets
app = FastAPI()
OPENAI_WS = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03"
@app.websocket("/twilio/stream")
async def twilio_stream(ws: WebSocket):
await ws.accept()
async with websockets.connect(
OPENAI_WS,
extra_headers={
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "realtime=v1",
},
) as oai:
await oai.send(json.dumps({
"type": "session.update",
"session": {
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"turn_detection": {"type": "server_vad", "silence_duration_ms": 400},
"instructions": "You are a concise, friendly receptionist.",
},
}))
async def from_twilio():
async for msg in ws.iter_text():
data = json.loads(msg)
if data.get("event") == "media":
pcm = ulaw_to_pcm16(base64.b64decode(data["media"]["payload"]))
await oai.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(pcm).decode(),
}))
async def from_openai():
async for msg in oai:
evt = json.loads(msg)
if evt["type"] == "response.audio.delta":
await ws.send_text(json.dumps({
"event": "media",
"media": {"payload": pcm16_to_ulaw_b64(evt["delta"])},
}))
await asyncio.gather(from_twilio(), from_openai())
2. Let the model handle VAD and interruptions
Server-side VAD is the difference between a conversation and a monologue. When the caller starts speaking while the agent is mid-sentence, the Realtime API fires input_audio_buffer.speech_started — your edge must immediately stop the downstream audio playback so the caller is not talked over.
if evt["type"] == "input_audio_buffer.speech_started":
await ws.send_text(json.dumps({"event": "clear"}))
await oai.send(json.dumps({"type": "response.cancel"}))
3. Wire up tool calls
The LLM is only as useful as the tools you give it. Define a small, strongly-typed tool schema, keep the arguments minimal, and validate the output on the server before returning it to the model.
TOOLS = [{
"type": "function",
"name": "book_appointment",
"description": "Book a medical appointment for a patient.",
"parameters": {
"type": "object",
"properties": {
"patient_id": {"type": "string"},
"provider_id": {"type": "string"},
"start_iso": {"type": "string", "description": "ISO 8601 start time"},
"reason": {"type": "string"},
},
"required": ["patient_id", "provider_id", "start_iso"],
},
}]
4. Stream TTS back to the caller
The Realtime API emits response.audio.delta events as the model speaks. You forward each frame to the carrier without waiting for the full response. End-of-turn is signaled by response.audio.done.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
5. Persist everything for post-call analytics
After the call ends, push the transcript and metadata to a queue so a GPT-4o-mini worker can extract sentiment, intent, and lead score without blocking the hot path.
async def on_call_end(call_id: str, transcript: list[dict]):
await queue.publish("post_call", {"call_id": call_id, "transcript": transcript})
Production considerations
- Latency budget: target 800ms end-to-end. Allocate 150ms network, 200ms STT partial, 250ms LLM first token, 150ms TTS first frame, 50ms edge.
- Observability: emit an OpenTelemetry span for each stage with the call SID as the trace ID.
- Cost: Realtime minutes are the biggest line item. Hang up aggressively on silence and cap max session duration.
- Scale: one Python worker can handle 20-40 concurrent sessions before event-loop contention bites. Scale horizontally behind a sticky load balancer.
- Failure modes: if OpenAI returns 5xx mid-call, fall back to a canned "one moment please" and retry once before handing off to a human.
CallSphere's real implementation
CallSphere runs this exact architecture in production. The voice and chat agents use the OpenAI Realtime API with gpt-4o-realtime-preview-2025-06-03, server VAD, and PCM16 at 24kHz. Post-call analytics are handled by a GPT-4o-mini pipeline that writes sentiment, intent, and lead score into per-vertical Postgres databases. Telephony goes through Twilio with a WebRTC fallback for in-browser testing.
Each vertical has a different multi-agent topology: 14 tools for the healthcare voice stack, 10 agents for real estate (buyer, seller, rental, tour, qualification, and more), 4 for salon, 7 for after-hours escalation, 10 tools plus RAG for IT helpdesk, and a sales pod that pairs ElevenLabs TTS with 5 GPT-4 specialists. Handoffs between agents are orchestrated with the OpenAI Agents SDK. The platform supports 57+ languages, and end-to-end response times stay under 1 second on our production traffic.
Common pitfalls
- Buffering audio too long: you will hear obvious lag. Flush frames as soon as they arrive.
- Ignoring the VAD speech-started event: the agent will talk over interrupting callers.
- Sharing one HTTP client across calls improperly: connection pool exhaustion under load.
- Letting tool calls block the audio loop: always run tools in a separate task.
- Logging raw PCM: you will blow out disk. Log metadata only.
- Hardcoding a single voice: different verticals and languages need different voices; parameterize it.
FAQ
Why not stitch separate STT, LLM, and TTS services together?
You can, and some teams do, but each hop adds 100-300ms of latency and makes interruption handling much harder. The Realtime API collapses the pipeline into one WebSocket and gives you a clean speech-started signal for free.
What sample rate should I use?
24kHz PCM16 end to end. Convert to and from G.711 ulaw only at the carrier boundary. Resampling in the middle of the pipeline is a common source of audio artifacts.
How do I prevent the model from hallucinating facts about my business?
Constrain it with tool calls. The model should look up availability, prices, and policies through functions, not recall them from the system prompt.
What is a realistic concurrent-call number per worker?
With a tight async loop and no blocking tool calls, 20-40 sessions per Python worker is achievable. Beyond that, scale horizontally.
How do I handle a caller who speaks a different language than expected?
Detect the language from the first user turn and reload the session with the matching voice and instructions. CallSphere supports 57+ languages this way.
Next steps
Ready to see a real voice agent running this architecture? Book a demo, explore the technology page, or check pricing to understand how CallSphere packages this stack for production use.
#CallSphere #AIVoiceAgents #OpenAIRealtime #VoiceAI #Twilio #RealtimeAPI #TechnicalGuide
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.