LLM API Gateway Design Patterns: Rate Limiting, Caching, and Fallbacks
Design patterns for building a production LLM API gateway — including intelligent rate limiting, semantic caching, provider fallbacks, and request routing for multi-model deployments.
Why LLM Applications Need a Specialized Gateway
Standard API gateways handle authentication, rate limiting, and routing for traditional APIs. LLM APIs have additional requirements that standard gateways do not address:
- Token-based billing: Costs scale with input/output tokens, not request count
- Variable latency: Streaming responses can take 5-30 seconds
- Multi-provider routing: Most production systems use multiple LLM providers (OpenAI, Anthropic, Google) for redundancy and cost optimization
- Semantic-aware caching: Identical queries should be cacheable even if worded slightly differently
- Content safety: Inputs and outputs may need content filtering before reaching the LLM or the user
An LLM API gateway sits between your application and LLM providers, handling these concerns in a single layer.
Core Pattern 1: Token-Aware Rate Limiting
Standard rate limiters count requests. LLM rate limiters need to count tokens, because a single request with a 100K context window costs 100x more than a simple query.
class TokenAwareRateLimiter:
def __init__(self, redis: Redis):
self.redis = redis
async def check_and_consume(
self, tenant_id: str, estimated_tokens: int
) -> bool:
key = f"ratelimit:{tenant_id}:{self.current_window()}"
current = await self.redis.get(key)
if current and int(current) + estimated_tokens > self.get_limit(tenant_id):
return False # Rate limited
pipe = self.redis.pipeline()
pipe.incrby(key, estimated_tokens)
pipe.expire(key, 60) # 1-minute window
await pipe.execute()
return True
def get_limit(self, tenant_id: str) -> int:
# Per-tenant token limits
return self.tenant_limits.get(tenant_id, 100_000) # Default 100K/min
Cost Budgets
Beyond rate limiting, implement cost budgets that track spending per tenant, team, or project. Alert when spending approaches the budget and hard-stop when it is exceeded.
Core Pattern 2: Semantic Caching Layer
Cache responses for semantically similar queries to reduce costs and latency.
class SemanticCacheLayer:
def __init__(self, vector_store, ttl_seconds: int = 3600):
self.vector_store = vector_store
self.ttl = ttl_seconds
async def get(self, messages: list[dict], model: str) -> CacheResult | None:
# Create cache key from the last user message + model
cache_query = self.extract_cache_key(messages)
embedding = await self.embed(cache_query)
results = await self.vector_store.search(
embedding, threshold=0.97, filter={"model": model}
)
if results and not self.is_expired(results[0]):
return CacheResult(
response=results[0].metadata["response"],
cache_hit=True
)
return None
async def set(self, messages: list[dict], model: str, response: str):
cache_query = self.extract_cache_key(messages)
embedding = await self.embed(cache_query)
await self.vector_store.insert(
embedding,
metadata={"response": response, "model": model, "timestamp": time.time()}
)
Important: Only cache deterministic, factual queries. Do not cache creative tasks, personalized responses, or time-sensitive queries.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Core Pattern 3: Provider Fallback and Load Balancing
When your primary LLM provider experiences outages or rate limits, automatically fall back to alternatives.
class LLMProviderRouter:
def __init__(self):
self.providers = [
ProviderConfig("anthropic", "claude-sonnet-4", priority=1, weight=0.6),
ProviderConfig("openai", "gpt-4o", priority=1, weight=0.4),
ProviderConfig("anthropic", "claude-haiku-4", priority=2, weight=1.0), # Fallback
]
self.circuit_breakers = {p.name: CircuitBreaker() for p in self.providers}
async def route(self, request: LLMRequest) -> LLMResponse:
# Group by priority, try highest priority first
for priority_group in self.group_by_priority():
available = [
p for p in priority_group
if self.circuit_breakers[p.name].is_closed()
]
if not available:
continue
# Weighted random selection within priority group
provider = self.weighted_select(available)
try:
response = await provider.complete(request)
self.circuit_breakers[provider.name].record_success()
return response
except (RateLimitError, TimeoutError, ServerError) as e:
self.circuit_breakers[provider.name].record_failure()
continue
raise AllProvidersUnavailable()
Core Pattern 4: Request/Response Transformation
Normalize requests and responses across providers so your application code does not need provider-specific logic.
The gateway translates between a unified internal format and each provider's API format:
- Normalize message formats (OpenAI's
messagesarray vs. Anthropic's format) - Map model names to provider-specific identifiers
- Standardize tool/function calling formats
- Normalize streaming event formats
Core Pattern 5: Observability and Logging
Every request through the gateway should be logged with:
- Request/response token counts
- Cost calculation (based on model pricing)
- Latency breakdown (queue time, TTFT, total)
- Cache hit/miss status
- Provider used (primary vs. fallback)
- Content safety filter results
Structured Logging
{
"trace_id": "abc-123",
"tenant_id": "tenant-456",
"model_requested": "claude-sonnet-4",
"provider_used": "anthropic",
"input_tokens": 1523,
"output_tokens": 487,
"cost_usd": 0.0061,
"latency_ms": 2340,
"ttft_ms": 890,
"cache_hit": false,
"fallback_used": false
}
Existing Solutions
Before building your own gateway, evaluate existing options:
- LiteLLM: Open-source proxy supporting 100+ LLM providers with a unified OpenAI-compatible API
- Portkey: Managed LLM gateway with built-in caching, fallbacks, and observability
- Helicone: Observability-focused LLM proxy with cost tracking and prompt management
For most teams, starting with LiteLLM and adding custom middleware for your specific needs is the fastest path to production.
Sources:
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.