Speech-to-Text for AI Agents: Comparing Whisper, Deepgram, and AssemblyAI
A practical comparison of the three leading STT engines for voice AI agents — OpenAI Whisper, Deepgram, and AssemblyAI — covering accuracy, latency, streaming capabilities, language support, and pricing.
Why STT Choice Matters for Voice Agents
The speech-to-text engine is the entry point for every voice AI agent. If transcription is slow, the entire pipeline stalls. If it is inaccurate, the language model receives garbled input and produces irrelevant responses. Choosing the right STT provider is one of the most consequential decisions in voice agent architecture.
This guide compares three production-grade options: OpenAI Whisper (self-hosted), Deepgram Nova, and AssemblyAI Universal. Each excels in different scenarios.
OpenAI Whisper: The Open-Source Powerhouse
Whisper is an open-source model from OpenAI trained on 680,000 hours of multilingual audio. It runs locally or via the OpenAI API, giving you full control over cost and privacy.
import whisper
import numpy as np
class WhisperSTT:
def __init__(self, model_size: str = "base"):
# Model sizes: tiny, base, small, medium, large-v3
self.model = whisper.load_model(model_size)
def transcribe_file(self, audio_path: str) -> dict:
result = self.model.transcribe(
audio_path,
language="en",
fp16=True, # Use half precision on GPU
condition_on_previous_text=True,
)
return {
"text": result["text"],
"segments": result["segments"],
"language": result["language"],
}
def transcribe_array(self, audio_array: np.ndarray) -> str:
"""Transcribe raw audio from a NumPy array (16kHz mono)."""
result = self.model.transcribe(audio_array)
return result["text"]
# Usage
stt = WhisperSTT("small")
result = stt.transcribe_file("call_recording.wav")
print(result["text"])
Strengths: Free when self-hosted, excellent accuracy on clean audio, supports 99 languages, full data privacy. Weaknesses: No native streaming support (batch-only), requires GPU for real-time performance, higher latency than cloud APIs.
For real-time agents, you can use faster-whisper, a CTranslate2 port that runs 4x faster than the original:
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5, vad_filter=True)
for segment in segments:
print(f"[{segment.start:.1f}s -> {segment.end:.1f}s] {segment.text}")
Deepgram Nova: Built for Real-Time
Deepgram Nova-2 is purpose-built for low-latency streaming transcription. It consistently achieves the fastest time-to-first-transcript among cloud providers.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions
import asyncio
class DeepgramSTT:
def __init__(self, api_key: str):
self.client = DeepgramClient(api_key)
async def stream_microphone(self, callback):
connection = self.client.listen.asynclive.v("1")
async def on_transcript(self, result, **kwargs):
alt = result.channel.alternatives[0]
if alt.transcript:
callback(
text=alt.transcript,
is_final=result.is_final,
confidence=alt.confidence,
words=alt.words,
)
connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
options = LiveOptions(
model="nova-2",
language="en-US",
smart_format=True, # Auto punctuation and formatting
diarize=True, # Speaker identification
interim_results=True,
endpointing=300,
filler_words=False, # Remove "um", "uh"
utterance_end_ms=1000,
)
await connection.start(options)
return connection
# Usage
stt = DeepgramSTT("your-api-key")
def handle_transcript(text, is_final, confidence, words):
prefix = "FINAL" if is_final else "INTERIM"
print(f"[{prefix}] ({confidence:.2f}) {text}")
asyncio.run(stt.stream_microphone(handle_transcript))
Strengths: Sub-200ms streaming latency, built-in diarization, smart formatting, excellent for real-time agents. Weaknesses: Cloud-only (no self-hosted option), cost scales with usage.
AssemblyAI Universal: Best-in-Class Accuracy
AssemblyAI Universal-2 leads accuracy benchmarks, especially on noisy audio, accented speech, and domain-specific vocabulary.
import assemblyai as aai
class AssemblyAISTT:
def __init__(self, api_key: str):
aai.settings.api_key = api_key
def transcribe_with_analysis(self, audio_url: str) -> dict:
config = aai.TranscriptionConfig(
speech_model=aai.SpeechModel.best,
speaker_labels=True,
auto_highlights=True,
sentiment_analysis=True,
entity_detection=True,
)
transcriber = aai.Transcriber()
transcript = transcriber.transcribe(audio_url, config=config)
return {
"text": transcript.text,
"utterances": [
{"speaker": u.speaker, "text": u.text}
for u in transcript.utterances
],
"sentiment": transcript.sentiment_analysis,
"entities": transcript.entities,
}
def stream_realtime(self, on_data):
transcriber = aai.RealtimeTranscriber(
sample_rate=16000,
on_data=on_data,
on_error=lambda e: print(f"Error: {e}"),
)
transcriber.connect()
return transcriber
Strengths: Highest accuracy on difficult audio, built-in NLU features (sentiment, entity detection, summarization), excellent streaming. Weaknesses: Higher per-minute pricing, fewer language options than Whisper.
Comparison Matrix
| Feature | Whisper (self-hosted) | Deepgram Nova-2 | AssemblyAI Universal-2 |
|---|---|---|---|
| Streaming | No (batch only) | Yes (sub-200ms) | Yes (sub-300ms) |
| WER (clean audio) | ~5% | ~6% | ~4.5% |
| Languages | 99 | 36 | 20+ |
| Self-hosted | Yes | No | No |
| Diarization | No (needs addon) | Built-in | Built-in |
| Price | Free (GPU cost) | $0.0043/min | $0.0062/min |
Choosing the Right Engine
For real-time voice agents where latency is critical, Deepgram Nova-2 is the strongest choice. For offline processing or when data privacy is paramount, self-hosted Whisper with faster-whisper gives you full control. For high-accuracy scenarios with challenging audio (call centers, medical transcription), AssemblyAI leads on accuracy benchmarks.
FAQ
Can I combine multiple STT engines for better results?
Yes, a common production pattern is to use Deepgram for real-time streaming during the conversation (optimizing for speed) and then re-transcribe the full recording with AssemblyAI or Whisper large-v3 afterward for analytics and compliance. This gives you the best of both worlds.
How do I handle background noise and accents?
All three engines handle moderate noise well, but preprocessing helps. Apply noise reduction before sending audio to the STT engine. For accents, AssemblyAI consistently performs best. You can also fine-tune Whisper on domain-specific audio data to improve accuracy for your specific use case.
What sample rate and format should I send audio in?
For all three providers, 16kHz mono PCM (linear16) is the standard. Higher sample rates like 48kHz do not improve accuracy and waste bandwidth. If your source audio is stereo, mix it to mono before sending.
#SpeechtoText #Whisper #Deepgram #AssemblyAI #VoiceAI #STT #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.