Skip to content
Technology6 min read0 views

LLM API Gateway Design Patterns: Rate Limiting, Caching, and Fallbacks

Design patterns for building a production LLM API gateway — including intelligent rate limiting, semantic caching, provider fallbacks, and request routing for multi-model deployments.

Why LLM Applications Need a Specialized Gateway

Standard API gateways handle authentication, rate limiting, and routing for traditional APIs. LLM APIs have additional requirements that standard gateways do not address:

  • Token-based billing: Costs scale with input/output tokens, not request count
  • Variable latency: Streaming responses can take 5-30 seconds
  • Multi-provider routing: Most production systems use multiple LLM providers (OpenAI, Anthropic, Google) for redundancy and cost optimization
  • Semantic-aware caching: Identical queries should be cacheable even if worded slightly differently
  • Content safety: Inputs and outputs may need content filtering before reaching the LLM or the user

An LLM API gateway sits between your application and LLM providers, handling these concerns in a single layer.

Core Pattern 1: Token-Aware Rate Limiting

Standard rate limiters count requests. LLM rate limiters need to count tokens, because a single request with a 100K context window costs 100x more than a simple query.

class TokenAwareRateLimiter:
    def __init__(self, redis: Redis):
        self.redis = redis

    async def check_and_consume(
        self, tenant_id: str, estimated_tokens: int
    ) -> bool:
        key = f"ratelimit:{tenant_id}:{self.current_window()}"
        current = await self.redis.get(key)

        if current and int(current) + estimated_tokens > self.get_limit(tenant_id):
            return False  # Rate limited

        pipe = self.redis.pipeline()
        pipe.incrby(key, estimated_tokens)
        pipe.expire(key, 60)  # 1-minute window
        await pipe.execute()
        return True

    def get_limit(self, tenant_id: str) -> int:
        # Per-tenant token limits
        return self.tenant_limits.get(tenant_id, 100_000)  # Default 100K/min

Cost Budgets

Beyond rate limiting, implement cost budgets that track spending per tenant, team, or project. Alert when spending approaches the budget and hard-stop when it is exceeded.

Core Pattern 2: Semantic Caching Layer

Cache responses for semantically similar queries to reduce costs and latency.

class SemanticCacheLayer:
    def __init__(self, vector_store, ttl_seconds: int = 3600):
        self.vector_store = vector_store
        self.ttl = ttl_seconds

    async def get(self, messages: list[dict], model: str) -> CacheResult | None:
        # Create cache key from the last user message + model
        cache_query = self.extract_cache_key(messages)
        embedding = await self.embed(cache_query)

        results = await self.vector_store.search(
            embedding, threshold=0.97, filter={"model": model}
        )

        if results and not self.is_expired(results[0]):
            return CacheResult(
                response=results[0].metadata["response"],
                cache_hit=True
            )
        return None

    async def set(self, messages: list[dict], model: str, response: str):
        cache_query = self.extract_cache_key(messages)
        embedding = await self.embed(cache_query)
        await self.vector_store.insert(
            embedding,
            metadata={"response": response, "model": model, "timestamp": time.time()}
        )

Important: Only cache deterministic, factual queries. Do not cache creative tasks, personalized responses, or time-sensitive queries.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Core Pattern 3: Provider Fallback and Load Balancing

When your primary LLM provider experiences outages or rate limits, automatically fall back to alternatives.

class LLMProviderRouter:
    def __init__(self):
        self.providers = [
            ProviderConfig("anthropic", "claude-sonnet-4", priority=1, weight=0.6),
            ProviderConfig("openai", "gpt-4o", priority=1, weight=0.4),
            ProviderConfig("anthropic", "claude-haiku-4", priority=2, weight=1.0),  # Fallback
        ]
        self.circuit_breakers = {p.name: CircuitBreaker() for p in self.providers}

    async def route(self, request: LLMRequest) -> LLMResponse:
        # Group by priority, try highest priority first
        for priority_group in self.group_by_priority():
            available = [
                p for p in priority_group
                if self.circuit_breakers[p.name].is_closed()
            ]
            if not available:
                continue

            # Weighted random selection within priority group
            provider = self.weighted_select(available)
            try:
                response = await provider.complete(request)
                self.circuit_breakers[provider.name].record_success()
                return response
            except (RateLimitError, TimeoutError, ServerError) as e:
                self.circuit_breakers[provider.name].record_failure()
                continue

        raise AllProvidersUnavailable()

Core Pattern 4: Request/Response Transformation

Normalize requests and responses across providers so your application code does not need provider-specific logic.

The gateway translates between a unified internal format and each provider's API format:

  • Normalize message formats (OpenAI's messages array vs. Anthropic's format)
  • Map model names to provider-specific identifiers
  • Standardize tool/function calling formats
  • Normalize streaming event formats

Core Pattern 5: Observability and Logging

Every request through the gateway should be logged with:

  • Request/response token counts
  • Cost calculation (based on model pricing)
  • Latency breakdown (queue time, TTFT, total)
  • Cache hit/miss status
  • Provider used (primary vs. fallback)
  • Content safety filter results

Structured Logging

{
  "trace_id": "abc-123",
  "tenant_id": "tenant-456",
  "model_requested": "claude-sonnet-4",
  "provider_used": "anthropic",
  "input_tokens": 1523,
  "output_tokens": 487,
  "cost_usd": 0.0061,
  "latency_ms": 2340,
  "ttft_ms": 890,
  "cache_hit": false,
  "fallback_used": false
}

Existing Solutions

Before building your own gateway, evaluate existing options:

  • LiteLLM: Open-source proxy supporting 100+ LLM providers with a unified OpenAI-compatible API
  • Portkey: Managed LLM gateway with built-in caching, fallbacks, and observability
  • Helicone: Observability-focused LLM proxy with cost tracking and prompt management

For most teams, starting with LiteLLM and adding custom middleware for your specific needs is the fastest path to production.

Sources:

Share this article
N

NYC News

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.