The Problem: LLM Sprawl

As organizations adopt AI across teams, a familiar pattern emerges: each team creates its own API keys, builds its own prompt pipelines, and integrates directly with LLM providers. Within months, the organization faces:

No cost visibility: Nobody knows who is spending what on which models
Inconsistent security: API keys scattered across repositories and environment variables
No rate limiting: One team's burst traffic affects everyone's rate limits
Provider lock-in: Every team is tightly coupled to a specific provider
No audit trail: No centralized log of what data is being sent to LLMs

An AI gateway solves all of these problems by providing a single, managed entry point for all LLM traffic across the organization.

AI Gateway Architecture

An AI gateway sits between your applications and LLM providers, acting as a reverse proxy with AI-specific capabilities:

[Team A App] ---+
                |
[Team B App] ---+---> [AI Gateway] ---> [Anthropic API]
                |         |          +-> [OpenAI API]
[Team C App] ---+         |          +-> [Self-hosted Models]
                          |
                    [Gateway Features]
                    - Authentication
                    - Rate Limiting
                    - Cost Tracking
                    - Logging & Audit
                    - Caching
                    - Fallback/Routing

Core Components

from fastapi import FastAPI, Request, Depends, HTTPException
from fastapi.middleware.cors import CORSMiddleware
import httpx
import time
import json
from typing import Optional

app = FastAPI(title="AI Gateway")

# ─── Authentication ───
async def authenticate(request: Request) -> dict:
    """Validate team API key and return team context."""
    api_key = request.headers.get("X-Gateway-Key")
    if not api_key:
        raise HTTPException(status_code=401, detail="Missing X-Gateway-Key header")

    team = await get_team_by_key(api_key)
    if not team:
        raise HTTPException(status_code=401, detail="Invalid API key")

    return team

# ─── Rate Limiting ───
class RateLimiter:
    def __init__(self, redis_client):
        self.redis = redis_client

    async def check_rate_limit(self, team_id: str, model: str) -> bool:
        """Check if team is within their rate limits."""
        key = f"rate:{team_id}:{model}:{int(time.time()) // 60}"
        current = await self.redis.incr(key)
        await self.redis.expire(key, 120)  # 2 minute TTL

        limit = await self.get_team_limit(team_id, model)
        if current > limit:
            return False
        return True

    async def get_team_limit(self, team_id: str, model: str) -> int:
        """Get per-minute rate limit for team and model."""
        limits = await self.redis.hget(f"limits:{team_id}", model)
        return int(limits) if limits else 60  # default: 60 RPM

# ─── Cost Tracking ───
class CostTracker:
    PRICING = {
        "claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0},
        "claude-haiku-3-5-20241022": {"input": 0.8, "output": 4.0},
        "gpt-4o": {"input": 2.5, "output": 10.0},
    }

    async def record_usage(self, team_id: str, model: str,
                           input_tokens: int, output_tokens: int):
        """Record token usage and calculate cost."""
        pricing = self.PRICING.get(model, {"input": 0, "output": 0})
        cost = (input_tokens / 1_000_000 * pricing["input"] +
                output_tokens / 1_000_000 * pricing["output"])

        await db.execute(
            """INSERT INTO usage_log (team_id, model, input_tokens, output_tokens,
               cost_usd, timestamp) VALUES ($1, $2, $3, $4, $5, NOW())""",
            team_id, model, input_tokens, output_tokens, cost
        )
        return cost

# ─── Provider Routing ───
class ProviderRouter:
    """Route requests to the appropriate LLM provider."""

    PROVIDER_MAP = {
        "claude-sonnet-4-20250514": {
            "provider": "anthropic",
            "url": "https://api.anthropic.com/v1/messages",
        },
        "claude-haiku-3-5-20241022": {
            "provider": "anthropic",
            "url": "https://api.anthropic.com/v1/messages",
        },
        "gpt-4o": {
            "provider": "openai",
            "url": "https://api.openai.com/v1/chat/completions",
        },
    }

    async def route(self, model: str, request_body: dict) -> dict:
        config = self.PROVIDER_MAP.get(model)
        if not config:
            raise HTTPException(status_code=400, detail=f"Unsupported model: {model}")

        if config["provider"] == "anthropic":
            return await self._call_anthropic(config["url"], request_body)
        elif config["provider"] == "openai":
            return await self._call_openai(config["url"], request_body)

    async def _call_anthropic(self, url: str, body: dict) -> dict:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                url,
                headers={
                    "x-api-key": os.environ["ANTHROPIC_API_KEY"],
                    "anthropic-version": "2023-06-01",
                    "content-type": "application/json",
                },
                json=body,
                timeout=120.0,
            )
            response.raise_for_status()
            return response.json()

The Main Gateway Endpoint

@app.post("/v1/chat")
async def gateway_chat(request: Request, team: dict = Depends(authenticate)):
    """Main gateway endpoint -- proxies to the appropriate LLM provider."""
    body = await request.json()
    model = body.get("model", "claude-sonnet-4-20250514")

    # Rate limiting
    if not await rate_limiter.check_rate_limit(team["id"], model):
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

    # Budget check
    if team.get("monthly_budget"):
        current_spend = await cost_tracker.get_monthly_spend(team["id"])
        if current_spend >= team["monthly_budget"]:
            raise HTTPException(status_code=402, detail="Monthly budget exhausted")

    # Audit logging (before sending to provider)
    request_id = str(uuid.uuid4())
    await audit_logger.log_request(request_id, team["id"], model, body)

    # PII detection (optional)
    if team.get("pii_detection_enabled"):
        pii_findings = await detect_pii(body)
        if pii_findings:
            await audit_logger.log_pii_warning(request_id, pii_findings)

    # Route to provider
    start_time = time.time()
    response = await router.route(model, body)
    latency = time.time() - start_time

    # Track costs
    input_tokens = response.get("usage", {}).get("input_tokens", 0)
    output_tokens = response.get("usage", {}).get("output_tokens", 0)
    cost = await cost_tracker.record_usage(team["id"], model, input_tokens, output_tokens)

    # Audit logging (after response)
    await audit_logger.log_response(request_id, latency, input_tokens, output_tokens, cost)

    # Add gateway metadata to response
    response["_gateway"] = {
        "request_id": request_id,
        "latency_ms": round(latency * 1000),
        "cost_usd": round(cost, 6),
    }

    return response

Fallback and Resilience Patterns

A critical gateway feature is automatic failover when a provider experiences issues:

class ResilientRouter:
    """Router with automatic failover between providers."""

    FALLBACK_CHAIN = {
        "claude-sonnet-4-20250514": ["claude-sonnet-4-20250514", "gpt-4o"],
        "gpt-4o": ["gpt-4o", "claude-sonnet-4-20250514"],
    }

    async def route_with_fallback(self, model: str, body: dict) -> dict:
        chain = self.FALLBACK_CHAIN.get(model, [model])
        last_error = None

        for fallback_model in chain:
            try:
                response = await self.route(fallback_model, body)
                if fallback_model != model:
                    logger.warning(f"Used fallback model {fallback_model} for {model}")
                return response
            except (httpx.TimeoutException, httpx.HTTPStatusError) as e:
                last_error = e
                logger.error(f"Provider error for {fallback_model}: {e}")
                continue

        raise HTTPException(status_code=502, detail=f"All providers failed: {last_error}")

Self-Service Admin Dashboard

The gateway should include an admin API for teams to manage their own usage:

@app.get("/admin/usage")
async def get_usage(team: dict = Depends(authenticate)):
    """Get team's usage statistics."""
    return {
        "team_id": team["id"],
        "current_month": {
            "total_requests": await cost_tracker.get_monthly_requests(team["id"]),
            "total_tokens": await cost_tracker.get_monthly_tokens(team["id"]),
            "total_cost_usd": await cost_tracker.get_monthly_spend(team["id"]),
            "budget_remaining": team.get("monthly_budget", 0) -
                               await cost_tracker.get_monthly_spend(team["id"]),
        },
        "by_model": await cost_tracker.get_monthly_breakdown_by_model(team["id"]),
        "daily_trend": await cost_tracker.get_daily_trend(team["id"], days=30),
    }

@app.get("/admin/audit-log")
async def get_audit_log(team: dict = Depends(authenticate),
                         limit: int = 100, offset: int = 0):
    """Get team's audit log."""
    return await audit_logger.get_team_logs(team["id"], limit, offset)

Open Source AI Gateway Options

Several open-source projects provide AI gateway functionality out of the box:

Project	Language	Key Features
LiteLLM Proxy	Python	100+ LLM providers, cost tracking, key management
Portkey	TypeScript	Caching, fallbacks, load balancing, observability
Kong AI Gateway	Lua/Go	Enterprise API gateway with AI plugins
Cloudflare AI Gateway	Managed	Caching, rate limiting, analytics

For most teams, starting with LiteLLM Proxy is the fastest path to a functional AI gateway:

# litellm-config.yaml
model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY

general_settings:
  master_key: sk-gateway-master-key
  database_url: postgresql://user:pass@localhost:5432/litellm

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: localhost
    port: 6379

Conclusion

An AI gateway is essential infrastructure for any organization running multiple LLM-powered applications. It provides unified authentication, cost visibility, rate limiting, audit logging, and provider abstraction -- all of which become critical as AI adoption scales. Start with an open-source solution like LiteLLM, customize as your needs grow, and make the gateway the single path for all LLM traffic in your organization.

AI Gateway Patterns: Centralizing LLM Access Across Your Organization

The Problem: LLM Sprawl

AI Gateway Architecture

Core Components

The Main Gateway Endpoint

Fallback and Resilience Patterns

Self-Service Admin Dashboard

Open Source AI Gateway Options

Conclusion

Try CallSphere AI Voice Agents

Related Articles

Massive Multitask Language Understanding (MMLU) benchmark evaluates general knowledge and reasoning

Claude Co-Work: How Claude Enables True Collaborative AI Development

Showcasing LLM Performance: How Research Papers Present Evaluation Results