Voice Agent Error Recovery: Handling Network Issues, Transcription Failures, and Timeouts
Build resilient voice AI agents that handle failures gracefully — covering retry strategies, fallback messages, circuit breakers, and graceful degradation patterns for network outages, STT errors, and LLM timeouts.
Why Voice Agents Need Robust Error Handling
Voice agents operate in a uniquely unforgiving environment. When a web page encounters an API error, it can show a loading spinner or an error message and the user waits patiently. When a voice agent goes silent for 3 seconds because of an unhandled error, the user thinks the call dropped. They hang up, and you lose the interaction.
Every component in the voice pipeline can fail: STT services return empty transcripts, LLM APIs time out, TTS services produce garbled audio, and network connections drop mid-conversation. Building a production voice agent means planning for every failure mode and ensuring the agent always has something to say.
The Error Recovery Framework
A comprehensive error recovery system has four layers: detection, classification, recovery, and user communication.
from enum import Enum
from dataclasses import dataclass
import asyncio
import time
class ErrorSeverity(Enum):
TRANSIENT = "transient" # Retry likely to succeed
DEGRADED = "degraded" # Partial functionality available
CRITICAL = "critical" # Cannot continue normally
class ErrorCategory(Enum):
STT_FAILURE = "stt_failure"
LLM_TIMEOUT = "llm_timeout"
LLM_ERROR = "llm_error"
TTS_FAILURE = "tts_failure"
NETWORK = "network"
AUDIO_QUALITY = "audio_quality"
@dataclass
class VoiceError:
category: ErrorCategory
severity: ErrorSeverity
message: str
timestamp: float
retryable: bool = True
class ErrorRecoveryManager:
def __init__(self):
self.error_history = []
self.circuit_breakers = {}
self.fallback_audio = {} # Pre-synthesized fallback messages
def classify_error(self, exception: Exception, stage: str) -> VoiceError:
"""Classify an exception into a structured VoiceError."""
if isinstance(exception, asyncio.TimeoutError):
if stage == "llm":
return VoiceError(
category=ErrorCategory.LLM_TIMEOUT,
severity=ErrorSeverity.TRANSIENT,
message="LLM response timed out",
timestamp=time.time(),
)
return VoiceError(
category=ErrorCategory.NETWORK,
severity=ErrorSeverity.TRANSIENT,
message=f"Timeout in {stage}",
timestamp=time.time(),
)
if isinstance(exception, ConnectionError):
return VoiceError(
category=ErrorCategory.NETWORK,
severity=ErrorSeverity.DEGRADED,
message=str(exception),
timestamp=time.time(),
)
return VoiceError(
category=ErrorCategory.LLM_ERROR,
severity=ErrorSeverity.CRITICAL,
message=str(exception),
timestamp=time.time(),
retryable=False,
)
Retry Strategies with Exponential Backoff
For transient errors, retries are the first line of defense. But voice agents cannot afford the long backoff delays typical in backend systems — the user is waiting in real time.
class VoiceRetryPolicy:
"""Fast retry policy optimized for real-time voice interactions."""
def __init__(
self,
max_retries: int = 2,
initial_delay_ms: int = 100,
max_delay_ms: int = 500,
backoff_factor: float = 2.0,
):
self.max_retries = max_retries
self.initial_delay_ms = initial_delay_ms
self.max_delay_ms = max_delay_ms
self.backoff_factor = backoff_factor
async def execute(self, func, *args, **kwargs):
"""Execute with retries, returning result or raising last error."""
last_error = None
delay_ms = self.initial_delay_ms
for attempt in range(self.max_retries + 1):
try:
return await asyncio.wait_for(
func(*args, **kwargs),
timeout=2.0, # Hard timeout per attempt
)
except Exception as e:
last_error = e
if attempt < self.max_retries:
await asyncio.sleep(delay_ms / 1000)
delay_ms = min(
delay_ms * self.backoff_factor,
self.max_delay_ms,
)
raise last_error
# Usage
retry = VoiceRetryPolicy(max_retries=2, initial_delay_ms=100)
try:
result = await retry.execute(llm_client.generate, prompt)
except Exception:
# All retries exhausted — use fallback
result = get_fallback_response(prompt)
Circuit Breaker Pattern
When a service is consistently failing, retries waste time and degrade the user experience. A circuit breaker stops attempting calls to a failing service and switches to a fallback immediately.
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 3,
reset_timeout_s: float = 30.0,
name: str = "default",
):
self.failure_threshold = failure_threshold
self.reset_timeout_s = reset_timeout_s
self.name = name
self.failure_count = 0
self.last_failure_time = 0
self.state = "closed" # closed = normal, open = failing
def can_execute(self) -> bool:
if self.state == "closed":
return True
# Check if enough time has passed to retry (half-open)
elapsed = time.time() - self.last_failure_time
if elapsed >= self.reset_timeout_s:
self.state = "half-open"
return True
return False
def record_success(self):
self.failure_count = 0
self.state = "closed"
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
print(f"Circuit breaker [{self.name}] OPEN — using fallback")
class ResilientLLMClient:
def __init__(self, primary_client, fallback_client):
self.primary = primary_client
self.fallback = fallback_client
self.breaker = CircuitBreaker(name="llm", failure_threshold=3)
async def generate(self, messages: list) -> str:
if self.breaker.can_execute():
try:
result = await asyncio.wait_for(
self.primary.chat(messages), timeout=3.0
)
self.breaker.record_success()
return result
except Exception:
self.breaker.record_failure()
# Fallback to secondary LLM
return await self.fallback.chat(messages)
Handling STT Failures
STT failures fall into two categories: empty transcripts (the engine returned nothing) and low-confidence transcripts (the engine returned unreliable text).
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class STTErrorHandler:
def __init__(self):
self.consecutive_empty = 0
self.max_empty_before_prompt = 3
async def handle_transcript(
self, text: str, confidence: float, is_final: bool
) -> dict:
if not is_final:
return {"action": "wait", "text": text}
# Empty transcript
if not text or not text.strip():
self.consecutive_empty += 1
if self.consecutive_empty >= self.max_empty_before_prompt:
self.consecutive_empty = 0
return {
"action": "prompt_user",
"message": "I'm having trouble hearing you. "
"Could you speak a bit louder or move "
"closer to your microphone?",
}
return {"action": "ignore"}
# Low confidence transcript
if confidence < 0.6:
return {
"action": "confirm",
"message": f'I think you said "{text}". Is that right?',
"original_text": text,
}
# Good transcript
self.consecutive_empty = 0
return {"action": "process", "text": text}
Pre-Synthesized Fallback Audio
The worst thing a voice agent can do is go silent during an error. Pre-synthesize fallback messages at startup so they are always available, even if the TTS service is down.
class FallbackAudioLibrary:
def __init__(self):
self.audio_cache = {}
async def preload(self, tts_client):
"""Pre-synthesize all fallback messages at startup."""
fallback_messages = {
"generic_error": "I'm sorry, I'm having a technical "
"issue right now. Let me try again.",
"network_error": "It seems we're having connection "
"issues. Please hold on a moment.",
"cant_hear": "I'm having trouble hearing you. Could "
"you try speaking a little louder?",
"timeout": "I apologize for the delay. Let me look "
"into that for you.",
"repeat": "I'm sorry, could you repeat that?",
"transfer": "Let me connect you with a human agent "
"who can help you better.",
"goodbye": "Thank you for calling. Goodbye!",
}
for key, message in fallback_messages.items():
try:
self.audio_cache[key] = await tts_client.synthesize(message)
print(f"Pre-loaded fallback: {key}")
except Exception as e:
print(f"Warning: Could not pre-load {key}: {e}")
def get(self, key: str) -> bytes | None:
return self.audio_cache.get(key)
Network Disconnection and Reconnection
WebSocket and WebRTC connections can drop at any time. Implement automatic reconnection with state recovery.
class ResilientConnection {
constructor(url, options = {}) {
this.url = url;
this.maxRetries = options.maxRetries || 5;
this.baseDelay = options.baseDelay || 1000;
this.retryCount = 0;
this.ws = null;
this.messageQueue = [];
this.onMessage = options.onMessage || (() => {});
this.onReconnect = options.onReconnect || (() => {});
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
console.log('Connected');
this.retryCount = 0;
// Flush queued messages
while (this.messageQueue.length > 0) {
this.ws.send(this.messageQueue.shift());
}
this.onReconnect();
};
this.ws.onmessage = (event) => this.onMessage(event);
this.ws.onclose = (event) => {
if (event.code !== 1000) {
// Abnormal closure — attempt reconnect
this.reconnect();
}
};
this.ws.onerror = () => {
// Error will trigger onclose, which handles reconnection
};
}
reconnect() {
if (this.retryCount >= this.maxRetries) {
console.error('Max reconnection attempts reached');
return;
}
const delay = this.baseDelay * Math.pow(2, this.retryCount);
const jitter = delay * 0.2 * Math.random();
this.retryCount++;
console.log(
'Reconnecting in ' + Math.round(delay + jitter) + 'ms ' +
'(attempt ' + this.retryCount + '/' + this.maxRetries + ')'
);
setTimeout(() => this.connect(), delay + jitter);
}
send(data) {
if (this.ws && this.ws.readyState === WebSocket.OPEN) {
this.ws.send(data);
} else {
// Queue messages during disconnection
this.messageQueue.push(data);
}
}
}
Graceful Degradation Strategy
When multiple components fail, degrade gracefully rather than crashing. Define a degradation hierarchy.
class DegradationManager:
"""Manage graceful degradation when services fail."""
def __init__(self):
self.service_status = {
"stt": True,
"llm": True,
"tts": True,
}
def get_degradation_level(self) -> str:
if all(self.service_status.values()):
return "full" # All services operational
if self.service_status["llm"]:
return "limited" # Can still reason, but degraded I/O
return "emergency" # Cannot reason, transfer to human
async def handle_request(self, audio_input, pipeline, transfer_fn):
level = self.get_degradation_level()
if level == "full":
return await pipeline.full_process(audio_input)
elif level == "limited":
# STT or TTS down — use text fallback
if not self.service_status["stt"]:
# Ask user to type instead
return pipeline.get_fallback_audio("type_instead")
if not self.service_status["tts"]:
# Return text response for display
transcript = await pipeline.stt_process(audio_input)
return await pipeline.llm_process(transcript)
else:
# Emergency — transfer to human
await transfer_fn()
return pipeline.get_fallback_audio("transfer")
FAQ
How many retries should a voice agent attempt before falling back?
For real-time voice, limit retries to 1-2 attempts with very short delays (100-200ms). The total retry budget should not exceed 500ms. Users are waiting in silence during retries, and even a half-second of silence feels awkward. It is better to play a brief fallback message ("One moment, please") and retry in the background than to leave the user in silence while retrying.
Should the agent tell the user when an error occurs?
Yes, but frame it conversationally, not technically. Instead of "I experienced a transcription error," say "I didn't quite catch that — could you say that again?" Users do not need to know about your internal architecture. The goal is to keep the conversation flowing naturally even when things go wrong behind the scenes. Only escalate to explicit error messaging ("I'm having technical difficulties") when the problem persists across multiple exchanges.
How do I test error recovery in voice agents?
Use chaos engineering principles. Build a test harness that injects failures at each pipeline stage: drop STT connections mid-stream, return empty transcripts, add 5-second LLM delays, and corrupt TTS audio. Run automated conversations through this harness and verify that the agent always responds within your latency budget and never goes silent. Record these test sessions and listen to them to verify the recovery experience sounds natural.
#ErrorRecovery #VoiceAI #Resilience #RetryStrategies #GracefulDegradation #FaultTolerance #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.