AI Voice Agent Architecture: Real-Time STT, LLM, and TTS Pipeline

The three-stage pipeline, done right

Even with the OpenAI Realtime API collapsing STT, LLM, and TTS into one endpoint, it is still useful to understand the pipeline as three distinct stages. You will still debug issues by stage. You will still profile latency by stage. And when a customer wants to swap in their own TTS (ElevenLabs, Cartesia, PlayHT) you need to know where the seams are.

This post is a deep dive into the real-time STT → LLM → TTS pipeline, including the streaming, back-pressure, and error-recovery patterns that separate production systems from demos.

mic/carrier ──► STT ──► LLM ──► TTS ──► speaker/carrier
                │       │      │
                ▼       ▼      ▼
           partials  tokens  audio frames

Architecture overview

┌──────────────┐   PCM16    ┌──────────────┐  tokens   ┌──────────────┐
│  STT stage   │──────────► │   LLM stage  │─────────► │  TTS stage   │
│  streaming   │            │  streaming   │           │  streaming   │
└──────────────┘            └──────────────┘           └──────────────┘
        ▲                          │                          │
        │                          │                          │
        └── interrupt on VAD ◄─────┘                          │
                                                              ▼
                                                     carrier / speaker

Prerequisites

A working audio pipeline from the carrier to your service.
Either the Realtime API or separate STT/LLM/TTS providers.
An understanding of streaming event semantics.

Step-by-step walkthrough

1. Streaming STT

Batch STT will not work for real-time. You need partial transcripts that arrive every 100-300ms.

# Example using Deepgram streaming as an STT-only alternative
from deepgram import DeepgramClient, LiveOptions
dg = DeepgramClient(DG_KEY)
conn = dg.listen.asyncwebsocket.v("1")

await conn.start(LiveOptions(
    model="nova-2",
    encoding="linear16",
    sample_rate=24000,
    interim_results=True,
    endpointing=300,
))

async def on_stt_message(result):
    if result.is_final:
        await on_user_utterance(result.channel.alternatives[0].transcript)

2. Streaming LLM

from openai import AsyncOpenAI
client = AsyncOpenAI()

async def stream_llm(messages):
    async with client.chat.completions.stream(
        model="gpt-4o",
        messages=messages,
    ) as stream:
        async for event in stream:
            if event.type == "content.delta":
                yield event.delta

3. Streaming TTS

# ElevenLabs streaming example
import requests

def stream_tts(text: str, voice_id: str):
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream"
    with requests.post(
        url,
        headers={"xi-api-key": EL_KEY},
        json={"text": text, "model_id": "eleven_turbo_v2_5"},
        stream=True,
    ) as r:
        for chunk in r.iter_content(chunk_size=1024):
            yield chunk

4. Gluing the pipeline together

async def handle_final_user_turn(text: str, session):
    session.messages.append({"role": "user", "content": text})
    buffer = ""
    async for token in stream_llm(session.messages):
        buffer += token
        # Flush on sentence boundary
        if buffer.endswith((".", "!", "?")):
            for audio_chunk in stream_tts(buffer, session.voice_id):
                await session.send_audio(audio_chunk)
            buffer = ""
    if buffer:
        for audio_chunk in stream_tts(buffer, session.voice_id):
            await session.send_audio(audio_chunk)

5. Handling interruption mid-pipeline

When VAD fires speech_started, you must cancel the in-flight LLM stream, drop any queued TTS chunks, and clear the carrier's playback buffer. Anything less and the caller will hear the agent keep talking over them.

async def on_interrupt(session):
    session.llm_cancel_event.set()
    await session.tts_queue.empty()
    await session.carrier.clear_playback()

6. Error recovery

STT dropout: insert a "sorry, could you repeat that?" and restart the stream.
LLM 5xx: fall back to a canned "one moment please", retry once, then escalate.
TTS 5xx: switch to a backup voice provider; never fall back to text silence.

Production considerations

Sentence boundaries: TTS sounds best when you flush at sentence boundaries. Do not stream word-by-word.
Audio format conversion: do it once at each seam, never in the middle.
Backpressure: if TTS cannot keep up with LLM, queue text and slow the LLM stream.
Observability: span per stage, ideally with first-token and first-frame timestamps.
Voice consistency: pin a voice per session; do not switch mid-response.

CallSphere's real implementation

CallSphere uses the OpenAI Realtime API with gpt-4o-realtime-preview-2025-06-03 for the STT → LLM → TTS pipeline in most verticals because collapsing all three into one WebSocket keeps first-word latency under 1 second and simplifies interruption handling. The sales vertical swaps the TTS leg for ElevenLabs streaming via 5 GPT-4 specialists orchestrated through the OpenAI Agents SDK; the rest — 14 healthcare tools, 10 real estate agents, 4 salon agents, 7 after-hours escalation tools, 10 plus RAG IT helpdesk tools — stay on the unified Realtime pipeline.

Audio is PCM16 at 24kHz end-to-end; conversion to G.711 ulaw happens only at the Twilio boundary. Server VAD drives interruption. A GPT-4o-mini post-call pipeline writes sentiment, intent, lead score, satisfaction, and escalation flags into per-vertical Postgres databases. CallSphere supports 57+ languages with sub-second end-to-end response times.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Common pitfalls

Streaming word-by-word to TTS: robotic cadence.
Ignoring the interruption path: talking over callers.
Separate audio format per stage: drift and artifacts.
Treating the LLM stream as atomic: you lose the ability to speak while reasoning.
No fallback TTS: one provider outage = total outage.

FAQ

Should I build this on top of the Realtime API or compose three providers?

Start with the Realtime API. Compose only if you need a specific voice or a specific STT model.

What about open-source TTS?

XTTS, Orpheus, and Coqui all work but add latency and operational overhead. Fine for staging, rarely for production.

Can I cache common responses?

For greetings and holding phrases yes. Cache the audio and replay it directly.

How do I handle overlapping speech?

Rely on server VAD to detect it and cancel the current response.

What sample rate is ideal?

24kHz PCM16 matches the Realtime API and ElevenLabs Turbo. 16kHz works for STT-only stacks.

Next steps

Want to see the full pipeline running on real traffic? Book a demo, read the technology page, or see pricing.

#CallSphere #STT #TTS #VoiceAI #Architecture #Streaming #AIVoiceAgents