Speech-to-Text for AI Agents: Comparing Whisper, Deepgram, and AssemblyAI

Why STT Choice Matters for Voice Agents

The speech-to-text engine is the entry point for every voice AI agent. If transcription is slow, the entire pipeline stalls. If it is inaccurate, the language model receives garbled input and produces irrelevant responses. Choosing the right STT provider is one of the most consequential decisions in voice agent architecture.

This guide compares three production-grade options: OpenAI Whisper (self-hosted), Deepgram Nova, and AssemblyAI Universal. Each excels in different scenarios.

OpenAI Whisper: The Open-Source Powerhouse

Whisper is an open-source model from OpenAI trained on 680,000 hours of multilingual audio. It runs locally or via the OpenAI API, giving you full control over cost and privacy.

import whisper
import numpy as np

class WhisperSTT:
    def __init__(self, model_size: str = "base"):
        # Model sizes: tiny, base, small, medium, large-v3
        self.model = whisper.load_model(model_size)

    def transcribe_file(self, audio_path: str) -> dict:
        result = self.model.transcribe(
            audio_path,
            language="en",
            fp16=True,           # Use half precision on GPU
            condition_on_previous_text=True,
        )
        return {
            "text": result["text"],
            "segments": result["segments"],
            "language": result["language"],
        }

    def transcribe_array(self, audio_array: np.ndarray) -> str:
        """Transcribe raw audio from a NumPy array (16kHz mono)."""
        result = self.model.transcribe(audio_array)
        return result["text"]

# Usage
stt = WhisperSTT("small")
result = stt.transcribe_file("call_recording.wav")
print(result["text"])

Strengths: Free when self-hosted, excellent accuracy on clean audio, supports 99 languages, full data privacy. Weaknesses: No native streaming support (batch-only), requires GPU for real-time performance, higher latency than cloud APIs.

For real-time agents, you can use faster-whisper, a CTranslate2 port that runs 4x faster than the original:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5, vad_filter=True)

for segment in segments:
    print(f"[{segment.start:.1f}s -> {segment.end:.1f}s] {segment.text}")

Deepgram Nova: Built for Real-Time

Deepgram Nova-2 is purpose-built for low-latency streaming transcription. It consistently achieves the fastest time-to-first-transcript among cloud providers.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions
import asyncio

class DeepgramSTT:
    def __init__(self, api_key: str):
        self.client = DeepgramClient(api_key)

    async def stream_microphone(self, callback):
        connection = self.client.listen.asynclive.v("1")

        async def on_transcript(self, result, **kwargs):
            alt = result.channel.alternatives[0]
            if alt.transcript:
                callback(
                    text=alt.transcript,
                    is_final=result.is_final,
                    confidence=alt.confidence,
                    words=alt.words,
                )

        connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

        options = LiveOptions(
            model="nova-2",
            language="en-US",
            smart_format=True,       # Auto punctuation and formatting
            diarize=True,            # Speaker identification
            interim_results=True,
            endpointing=300,
            filler_words=False,      # Remove "um", "uh"
            utterance_end_ms=1000,
        )

        await connection.start(options)
        return connection

# Usage
stt = DeepgramSTT("your-api-key")

def handle_transcript(text, is_final, confidence, words):
    prefix = "FINAL" if is_final else "INTERIM"
    print(f"[{prefix}] ({confidence:.2f}) {text}")

asyncio.run(stt.stream_microphone(handle_transcript))

Strengths: Sub-200ms streaming latency, built-in diarization, smart formatting, excellent for real-time agents. Weaknesses: Cloud-only (no self-hosted option), cost scales with usage.

AssemblyAI Universal: Best-in-Class Accuracy

AssemblyAI Universal-2 leads accuracy benchmarks, especially on noisy audio, accented speech, and domain-specific vocabulary.

import assemblyai as aai

class AssemblyAISTT:
    def __init__(self, api_key: str):
        aai.settings.api_key = api_key

    def transcribe_with_analysis(self, audio_url: str) -> dict:
        config = aai.TranscriptionConfig(
            speech_model=aai.SpeechModel.best,
            speaker_labels=True,
            auto_highlights=True,
            sentiment_analysis=True,
            entity_detection=True,
        )

        transcriber = aai.Transcriber()
        transcript = transcriber.transcribe(audio_url, config=config)

        return {
            "text": transcript.text,
            "utterances": [
                {"speaker": u.speaker, "text": u.text}
                for u in transcript.utterances
            ],
            "sentiment": transcript.sentiment_analysis,
            "entities": transcript.entities,
        }

    def stream_realtime(self, on_data):
        transcriber = aai.RealtimeTranscriber(
            sample_rate=16000,
            on_data=on_data,
            on_error=lambda e: print(f"Error: {e}"),
        )
        transcriber.connect()
        return transcriber

Strengths: Highest accuracy on difficult audio, built-in NLU features (sentiment, entity detection, summarization), excellent streaming. Weaknesses: Higher per-minute pricing, fewer language options than Whisper.

Comparison Matrix

Feature	Whisper (self-hosted)	Deepgram Nova-2	AssemblyAI Universal-2
Streaming	No (batch only)	Yes (sub-200ms)	Yes (sub-300ms)
WER (clean audio)	~5%	~6%	~4.5%
Languages	99	36	20+
Self-hosted	Yes	No	No
Diarization	No (needs addon)	Built-in	Built-in
Price	Free (GPU cost)	$0.0043/min	$0.0062/min

Choosing the Right Engine

For real-time voice agents where latency is critical, Deepgram Nova-2 is the strongest choice. For offline processing or when data privacy is paramount, self-hosted Whisper with faster-whisper gives you full control. For high-accuracy scenarios with challenging audio (call centers, medical transcription), AssemblyAI leads on accuracy benchmarks.

FAQ

Can I combine multiple STT engines for better results?

Yes, a common production pattern is to use Deepgram for real-time streaming during the conversation (optimizing for speed) and then re-transcribe the full recording with AssemblyAI or Whisper large-v3 afterward for analytics and compliance. This gives you the best of both worlds.

How do I handle background noise and accents?

All three engines handle moderate noise well, but preprocessing helps. Apply noise reduction before sending audio to the STT engine. For accents, AssemblyAI consistently performs best. You can also fine-tune Whisper on domain-specific audio data to improve accuracy for your specific use case.

What sample rate and format should I send audio in?

For all three providers, 16kHz mono PCM (linear16) is the standard. Higher sample rates like 48kHz do not improve accuracy and waste bandwidth. If your source audio is stereo, mix it to mono before sending.

#SpeechtoText #Whisper #Deepgram #AssemblyAI #VoiceAI #STT #AgenticAI #LearnAI #AIEngineering

Speech-to-Text for AI Agents: Comparing Whisper, Deepgram, and AssemblyAI

Why STT Choice Matters for Voice Agents

OpenAI Whisper: The Open-Source Powerhouse

Deepgram Nova: Built for Real-Time

AssemblyAI Universal: Best-in-Class Accuracy

Comparison Matrix

Choosing the Right Engine

FAQ

Can I combine multiple STT engines for better results?

How do I handle background noise and accents?

What sample rate and format should I send audio in?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding