Voice AI Architecture: Understanding the STT-LLM-TTS Pipeline
Learn the three-stage pipeline that powers every voice AI agent — speech-to-text, language model reasoning, and text-to-speech — including latency budgets, streaming strategies, and practical implementation patterns.
The Three Stages of a Voice AI Agent
Every voice AI agent — whether it is a customer service bot, a voice assistant, or a conversational IVR — follows the same fundamental pipeline. Audio comes in from a microphone, gets converted to text, passes through a language model for reasoning, and the response gets converted back to speech. This is the STT-LLM-TTS pipeline, and understanding each stage is essential for building responsive voice agents.
The pipeline looks deceptively simple, but each stage introduces latency, and the cumulative delay determines whether your agent feels natural or robotic.
Stage 1: Speech-to-Text (STT)
The STT stage converts raw audio into text that the language model can process. Modern STT engines use transformer-based models trained on thousands of hours of multilingual speech data.
import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions
class STTProcessor:
def __init__(self, api_key: str):
self.client = DeepgramClient(api_key)
self.transcript_buffer = []
async def start_streaming(self, on_transcript):
connection = self.client.listen.asynclive.v("1")
async def on_message(self, result, **kwargs):
transcript = result.channel.alternatives[0].transcript
if transcript:
on_transcript(transcript, result.is_final)
connection.on(LiveTranscriptionEvents.Transcript, on_message)
options = LiveOptions(
model="nova-2",
language="en",
encoding="linear16",
sample_rate=16000,
interim_results=True, # Get partial results for faster feedback
endpointing=300, # Silence threshold in ms
vad_events=True, # Voice activity detection
)
await connection.start(options)
return connection
Key STT considerations include model accuracy (measured by Word Error Rate), streaming versus batch mode, and endpointing — detecting when the user has finished speaking. Streaming STT returns interim results as the user speaks, which enables the pipeline to start LLM processing before the user finishes their sentence.
Stage 2: Language Model (LLM)
Once text is available, it is sent to a language model for reasoning. The LLM maintains conversation context, interprets intent, calls tools if needed, and generates a response.
import openai
class LLMProcessor:
def __init__(self, model: str = "gpt-4o-mini"):
self.client = openai.AsyncOpenAI()
self.model = model
self.messages = []
async def process_streaming(self, user_text: str):
self.messages.append({"role": "user", "content": user_text})
stream = await self.client.chat.completions.create(
model=self.model,
messages=self.messages,
stream=True,
max_tokens=200, # Keep responses concise for voice
temperature=0.7,
)
full_response = []
async for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
full_response.append(delta)
yield delta # Stream tokens to TTS immediately
self.messages.append({
"role": "assistant",
"content": "".join(full_response),
})
For voice agents, the LLM should generate short, conversational responses. Long paragraphs that work in chat feel unnatural when spoken aloud. System prompts should instruct the model to keep answers under two or three sentences.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Stage 3: Text-to-Speech (TTS)
The final stage converts the LLM response into audio. Modern TTS engines produce remarkably natural speech with appropriate prosody, emotion, and pacing.
import httpx
class TTSProcessor:
def __init__(self, api_key: str):
self.api_key = api_key
self.base_url = "https://api.elevenlabs.io/v1"
async def synthesize_streaming(self, text_chunks):
"""Stream TTS as text tokens arrive from LLM."""
buffer = ""
async for chunk in text_chunks:
buffer += chunk
# Send to TTS at sentence boundaries for natural prosody
if any(buffer.endswith(p) for p in [".", "!", "?", ","]):
audio = await self._synthesize(buffer.strip())
yield audio
buffer = ""
if buffer.strip():
yield await self._synthesize(buffer.strip())
async def _synthesize(self, text: str) -> bytes:
async with httpx.AsyncClient() as client:
response = await client.post(
f"{self.base_url}/text-to-speech/voice_id/stream",
headers={"xi-api-key": self.api_key},
json={"text": text, "model_id": "eleven_turbo_v2_5"},
)
return response.content
Latency Budget Breakdown
A responsive voice agent needs end-to-end latency under 800ms. Here is a typical budget:
- STT endpointing: 200-400ms (silence detection after user stops)
- STT final transcription: 100-300ms
- LLM first token: 200-500ms
- TTS first audio byte: 100-300ms
- Network overhead: 50-100ms
The key optimization is streaming at every stage. Instead of waiting for each stage to complete, you stream partial results to the next stage. Interim STT results can warm up the LLM context. Streaming LLM tokens feed directly into streaming TTS. This overlapping approach can cut perceived latency by 40-60%.
Putting It All Together
class VoiceAgentPipeline:
def __init__(self, stt, llm, tts):
self.stt = stt
self.llm = llm
self.tts = tts
async def handle_audio(self, audio_stream):
# STT processes audio and emits transcripts
transcript = await self.stt.transcribe(audio_stream)
# LLM streams response tokens
token_stream = self.llm.process_streaming(transcript)
# TTS converts tokens to audio as they arrive
async for audio_chunk in self.tts.synthesize_streaming(token_stream):
yield audio_chunk # Send to client immediately
FAQ
What is the biggest bottleneck in the voice AI pipeline?
The LLM stage typically contributes the most latency, especially the time to first token (TTFT). Using smaller models like GPT-4o-mini, or deploying local models with vLLM, can significantly reduce this bottleneck. Streaming the LLM output so TTS can start before the full response is generated is the single most impactful optimization.
Can I run the entire pipeline locally without cloud APIs?
Yes. You can use Whisper for STT, a local LLM via Ollama or vLLM, and Piper or Coqui TTS for speech synthesis. Local pipelines eliminate network latency entirely but require a GPU-equipped machine for acceptable performance. A machine with an NVIDIA RTX 4090 can run the full pipeline with sub-500ms latency.
How does the pipeline handle overlapping speech or interruptions?
This is called barge-in handling. The STT stage uses Voice Activity Detection (VAD) to detect when the user starts speaking during agent output. When barge-in is detected, the pipeline cancels the current TTS playback, processes the new user input, and generates a fresh response.
#VoiceAI #STT #TTS #LLMPipeline #SpeechRecognition #RealTimeAI #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.