Voice AI Architecture: Understanding the STT-LLM-TTS Pipeline

The Three Stages of a Voice AI Agent

Every voice AI agent — whether it is a customer service bot, a voice assistant, or a conversational IVR — follows the same fundamental pipeline. Audio comes in from a microphone, gets converted to text, passes through a language model for reasoning, and the response gets converted back to speech. This is the STT-LLM-TTS pipeline, and understanding each stage is essential for building responsive voice agents.

The pipeline looks deceptively simple, but each stage introduces latency, and the cumulative delay determines whether your agent feels natural or robotic.

Stage 1: Speech-to-Text (STT)

The STT stage converts raw audio into text that the language model can process. Modern STT engines use transformer-based models trained on thousands of hours of multilingual speech data.

import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

class STTProcessor:
    def __init__(self, api_key: str):
        self.client = DeepgramClient(api_key)
        self.transcript_buffer = []

    async def start_streaming(self, on_transcript):
        connection = self.client.listen.asynclive.v("1")

        async def on_message(self, result, **kwargs):
            transcript = result.channel.alternatives[0].transcript
            if transcript:
                on_transcript(transcript, result.is_final)

        connection.on(LiveTranscriptionEvents.Transcript, on_message)

        options = LiveOptions(
            model="nova-2",
            language="en",
            encoding="linear16",
            sample_rate=16000,
            interim_results=True,   # Get partial results for faster feedback
            endpointing=300,        # Silence threshold in ms
            vad_events=True,        # Voice activity detection
        )

        await connection.start(options)
        return connection

Key STT considerations include model accuracy (measured by Word Error Rate), streaming versus batch mode, and endpointing — detecting when the user has finished speaking. Streaming STT returns interim results as the user speaks, which enables the pipeline to start LLM processing before the user finishes their sentence.

Stage 2: Language Model (LLM)

Once text is available, it is sent to a language model for reasoning. The LLM maintains conversation context, interprets intent, calls tools if needed, and generates a response.

import openai

class LLMProcessor:
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = openai.AsyncOpenAI()
        self.model = model
        self.messages = []

    async def process_streaming(self, user_text: str):
        self.messages.append({"role": "user", "content": user_text})

        stream = await self.client.chat.completions.create(
            model=self.model,
            messages=self.messages,
            stream=True,
            max_tokens=200,       # Keep responses concise for voice
            temperature=0.7,
        )

        full_response = []
        async for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                full_response.append(delta)
                yield delta  # Stream tokens to TTS immediately

        self.messages.append({
            "role": "assistant",
            "content": "".join(full_response),
        })

For voice agents, the LLM should generate short, conversational responses. Long paragraphs that work in chat feel unnatural when spoken aloud. System prompts should instruct the model to keep answers under two or three sentences.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Stage 3: Text-to-Speech (TTS)

The final stage converts the LLM response into audio. Modern TTS engines produce remarkably natural speech with appropriate prosody, emotion, and pacing.

import httpx

class TTSProcessor:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.elevenlabs.io/v1"

    async def synthesize_streaming(self, text_chunks):
        """Stream TTS as text tokens arrive from LLM."""
        buffer = ""
        async for chunk in text_chunks:
            buffer += chunk
            # Send to TTS at sentence boundaries for natural prosody
            if any(buffer.endswith(p) for p in [".", "!", "?", ","]):
                audio = await self._synthesize(buffer.strip())
                yield audio
                buffer = ""
        if buffer.strip():
            yield await self._synthesize(buffer.strip())

    async def _synthesize(self, text: str) -> bytes:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/text-to-speech/voice_id/stream",
                headers={"xi-api-key": self.api_key},
                json={"text": text, "model_id": "eleven_turbo_v2_5"},
            )
            return response.content

Latency Budget Breakdown

A responsive voice agent needs end-to-end latency under 800ms. Here is a typical budget:

STT endpointing: 200-400ms (silence detection after user stops)
STT final transcription: 100-300ms
LLM first token: 200-500ms
TTS first audio byte: 100-300ms
Network overhead: 50-100ms

The key optimization is streaming at every stage. Instead of waiting for each stage to complete, you stream partial results to the next stage. Interim STT results can warm up the LLM context. Streaming LLM tokens feed directly into streaming TTS. This overlapping approach can cut perceived latency by 40-60%.

Putting It All Together

class VoiceAgentPipeline:
    def __init__(self, stt, llm, tts):
        self.stt = stt
        self.llm = llm
        self.tts = tts

    async def handle_audio(self, audio_stream):
        # STT processes audio and emits transcripts
        transcript = await self.stt.transcribe(audio_stream)

        # LLM streams response tokens
        token_stream = self.llm.process_streaming(transcript)

        # TTS converts tokens to audio as they arrive
        async for audio_chunk in self.tts.synthesize_streaming(token_stream):
            yield audio_chunk  # Send to client immediately

FAQ

What is the biggest bottleneck in the voice AI pipeline?

The LLM stage typically contributes the most latency, especially the time to first token (TTFT). Using smaller models like GPT-4o-mini, or deploying local models with vLLM, can significantly reduce this bottleneck. Streaming the LLM output so TTS can start before the full response is generated is the single most impactful optimization.

Can I run the entire pipeline locally without cloud APIs?

Yes. You can use Whisper for STT, a local LLM via Ollama or vLLM, and Piper or Coqui TTS for speech synthesis. Local pipelines eliminate network latency entirely but require a GPU-equipped machine for acceptable performance. A machine with an NVIDIA RTX 4090 can run the full pipeline with sub-500ms latency.

How does the pipeline handle overlapping speech or interruptions?

This is called barge-in handling. The STT stage uses Voice Activity Detection (VAD) to detect when the user starts speaking during agent output. When barge-in is detected, the pipeline cancels the current TTS playback, processes the new user input, and generates a fresh response.

#VoiceAI #STT #TTS #LLMPipeline #SpeechRecognition #RealTimeAI #AgenticAI #LearnAI #AIEngineering

Voice AI Architecture: Understanding the STT-LLM-TTS Pipeline

The Three Stages of a Voice AI Agent

Stage 1: Speech-to-Text (STT)

Stage 2: Language Model (LLM)

Stage 3: Text-to-Speech (TTS)

Latency Budget Breakdown

Putting It All Together

FAQ

What is the biggest bottleneck in the voice AI pipeline?

Can I run the entire pipeline locally without cloud APIs?

How does the pipeline handle overlapping speech or interruptions?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding