Fallback Model Chains: Automatic Failover Between LLM Providers
Build automatic failover systems that seamlessly switch between LLM providers when your primary model is unavailable. Learn provider health checks, quality comparison, and cost-aware routing.
Why Single-Provider Agents Are a Liability
If your AI agent depends on a single LLM provider and that provider goes down, your entire product stops. OpenAI, Anthropic, and Google all experience outages. Rate limits spike during peak hours. Regional networking issues block API calls from specific geographies.
A fallback model chain is an ordered list of LLM providers that your agent tries in sequence. If the primary fails, the agent automatically routes to the next provider with minimal latency impact and no user-visible error.
Designing the Provider Abstraction
The first step is abstracting the LLM call behind a uniform interface so your agent code never references a specific provider.
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Optional
import httpx
import time
@dataclass
class LLMResponse:
content: str
model: str
provider: str
latency_ms: float
input_tokens: int = 0
output_tokens: int = 0
class LLMProvider(ABC):
def __init__(self, name: str, api_key: str, model: str, cost_per_1k_tokens: float):
self.name = name
self.api_key = api_key
self.model = model
self.cost_per_1k_tokens = cost_per_1k_tokens
self.healthy = True
self.last_failure: float = 0
@abstractmethod
async def complete(self, messages: list[dict], temperature: float = 0.7) -> LLMResponse:
pass
def mark_unhealthy(self):
self.healthy = False
self.last_failure = time.time()
def should_retry_health(self, cooldown: float = 60.0) -> bool:
return time.time() - self.last_failure >= cooldown
Implementing Provider-Specific Adapters
Each provider gets a thin adapter that translates between the universal interface and the provider-specific API.
class OpenAIProvider(LLMProvider):
async def complete(self, messages: list[dict], temperature: float = 0.7) -> LLMResponse:
start = time.time()
async with httpx.AsyncClient() as client:
resp = await client.post(
"https://api.openai.com/v1/chat/completions",
json={"model": self.model, "messages": messages, "temperature": temperature},
headers={"Authorization": f"Bearer {self.api_key}"},
timeout=30.0,
)
resp.raise_for_status()
data = resp.json()
return LLMResponse(
content=data["choices"][0]["message"]["content"],
model=self.model,
provider=self.name,
latency_ms=(time.time() - start) * 1000,
input_tokens=data["usage"]["prompt_tokens"],
output_tokens=data["usage"]["completion_tokens"],
)
class AnthropicProvider(LLMProvider):
async def complete(self, messages: list[dict], temperature: float = 0.7) -> LLMResponse:
start = time.time()
async with httpx.AsyncClient() as client:
resp = await client.post(
"https://api.anthropic.com/v1/messages",
json={
"model": self.model,
"max_tokens": 4096,
"messages": messages,
"temperature": temperature,
},
headers={
"x-api-key": self.api_key,
"anthropic-version": "2023-06-01",
},
timeout=30.0,
)
resp.raise_for_status()
data = resp.json()
return LLMResponse(
content=data["content"][0]["text"],
model=self.model,
provider=self.name,
latency_ms=(time.time() - start) * 1000,
input_tokens=data["usage"]["input_tokens"],
output_tokens=data["usage"]["output_tokens"],
)
The Failover Chain
The chain tries each provider in priority order. Failed providers are marked unhealthy and periodically re-checked.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import logging
logger = logging.getLogger("agent.failover")
class FailoverChain:
def __init__(self, providers: list[LLMProvider]):
self.providers = providers
async def complete(self, messages: list[dict], temperature: float = 0.7) -> LLMResponse:
errors = []
for provider in self.providers:
if not provider.healthy:
if provider.should_retry_health():
logger.info(f"Re-checking health of {provider.name}")
else:
continue
try:
response = await provider.complete(messages, temperature)
if not provider.healthy:
provider.healthy = True
logger.info(f"{provider.name} recovered")
return response
except Exception as exc:
provider.mark_unhealthy()
errors.append((provider.name, exc))
logger.warning(f"{provider.name} failed: {exc}, trying next")
error_summary = "; ".join(f"{name}: {exc}" for name, exc in errors)
raise RuntimeError(f"All providers failed: {error_summary}")
# Usage
chain = FailoverChain([
OpenAIProvider("openai", "sk-...", "gpt-4o", cost_per_1k_tokens=0.03),
AnthropicProvider("anthropic", "sk-ant-...", "claude-sonnet-4-20250514", cost_per_1k_tokens=0.015),
])
Cost-Aware Routing
In non-emergency situations, you may prefer the cheapest healthy provider instead of strict priority ordering. Add a routing mode to the chain that sorts healthy providers by cost before iterating.
class SmartFailoverChain(FailoverChain):
def __init__(self, providers: list[LLMProvider], strategy: str = "priority"):
super().__init__(providers)
self.strategy = strategy
async def complete(self, messages: list[dict], temperature: float = 0.7) -> LLMResponse:
if self.strategy == "cost":
self.providers.sort(key=lambda p: p.cost_per_1k_tokens)
return await super().complete(messages, temperature)
FAQ
How do I handle different prompt formats between providers?
Use a message normalization layer that converts your internal message format to each provider's expected format. OpenAI and Anthropic use slightly different schemas for system messages and tool definitions. The adapter pattern shown above is the natural place to put this translation logic.
What if the fallback model produces lower quality output?
Track quality metrics per provider — for example, average user satisfaction or task completion rate. If the fallback model consistently underperforms for certain tasks, consider maintaining task-specific chains where critical tasks always route to the highest-quality provider and only less-critical tasks accept the lower-quality fallback.
Should I run health checks proactively or only on failure?
Both. Reactive health marking (on failure) provides immediate protection. Proactive health checks using a lightweight ping or minimal completion request (run on a timer every 30-60 seconds) let you detect recovery faster and avoid sending real user requests as the first test against a potentially still-broken provider.
#LLMFailover #ModelChains #ProviderRouting #Resilience #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.