Skip to content
Learn Agentic AI13 min read0 views

API Gateway Pattern for AI Agent Microservices: Routing, Auth, and Rate Limiting

Design an API gateway for AI agent microservices that handles intelligent routing, authentication, and rate limiting while keeping backend services focused on their core responsibilities.

Why AI Agent Systems Need an API Gateway

When an AI agent system is split into microservices — a conversation manager, a tool execution engine, a RAG retrieval service, a memory store — clients should not need to know about any of this. A mobile app sending a chat message should hit one endpoint, not three different services in sequence.

An API gateway sits between external clients and internal services. It accepts all incoming requests through a single entry point, handles cross-cutting concerns like authentication and rate limiting, and routes requests to the appropriate backend service. Without a gateway, every microservice must independently implement auth verification, CORS handling, request logging, and rate limiting.

Gateway Architecture for Agent Systems

The gateway for an AI agent system has specific routing needs. A user message might need to reach the conversation service, while an admin request to update tool configurations routes to the tool management service. Streaming LLM responses require WebSocket or SSE support at the gateway level.

Here is a gateway implementation using FastAPI that routes to multiple agent services:

from fastapi import FastAPI, Request, HTTPException, Depends
from fastapi.responses import StreamingResponse
import httpx
import time
from collections import defaultdict

app = FastAPI(title="Agent Gateway")

SERVICE_MAP = {
    "conversation": "http://conversation-manager:8000",
    "tools": "http://tool-execution:8001",
    "rag": "http://rag-retrieval:8002",
    "memory": "http://memory-service:8003",
}

# --- Authentication middleware ---
async def verify_api_key(request: Request) -> dict:
    api_key = request.headers.get("X-API-Key")
    if not api_key:
        raise HTTPException(status_code=401, detail="Missing API key")
    # Validate against auth service or local cache
    client_info = await auth_cache.get(api_key)
    if not client_info:
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                "http://auth-service:8010/validate",
                json={"api_key": api_key},
            )
            if resp.status_code != 200:
                raise HTTPException(status_code=401, detail="Invalid API key")
            client_info = resp.json()
            await auth_cache.set(api_key, client_info, ttl=300)
    return client_info

# --- Rate limiting ---
class RateLimiter:
    def __init__(self, requests_per_minute: int = 60):
        self.rpm = requests_per_minute
        self.windows: dict[str, list[float]] = defaultdict(list)

    def check(self, client_id: str) -> bool:
        now = time.time()
        window = self.windows[client_id]
        # Remove timestamps older than 60 seconds
        self.windows[client_id] = [
            t for t in window if now - t < 60
        ]
        if len(self.windows[client_id]) >= self.rpm:
            return False
        self.windows[client_id].append(now)
        return True

rate_limiter = RateLimiter(requests_per_minute=60)

@app.post("/api/v1/chat")
async def chat_endpoint(
    request: Request,
    client: dict = Depends(verify_api_key),
):
    if not rate_limiter.check(client["client_id"]):
        raise HTTPException(
            status_code=429,
            detail="Rate limit exceeded",
        )
    body = await request.json()
    async with httpx.AsyncClient(timeout=30.0) as http:
        resp = await http.post(
            f"{SERVICE_MAP['conversation']}/handle",
            json={**body, "client_id": client["client_id"]},
        )
    return resp.json()

Route Configuration with Path-Based Routing

A clean routing strategy maps URL path prefixes to backend services:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

# gateway-routes.yaml
routes:
  - prefix: /api/v1/chat
    service: conversation
    methods: [POST]
    timeout: 30s
    retry:
      max_attempts: 2
      retry_on: [502, 503]

  - prefix: /api/v1/tools
    service: tools
    methods: [GET, POST, PUT, DELETE]
    timeout: 10s
    auth_required: true
    roles: [admin]

  - prefix: /api/v1/search
    service: rag
    methods: [POST]
    timeout: 15s
    rate_limit:
      requests_per_minute: 30

  - prefix: /api/v1/memory
    service: memory
    methods: [GET, POST, DELETE]
    timeout: 5s

  - prefix: /api/v1/chat/stream
    service: conversation
    methods: [POST]
    protocol: sse
    timeout: 120s

The gateway reads this configuration at startup and builds its routing table. The protocol: sse flag tells the gateway to handle the response as a server-sent event stream rather than buffering the full response before forwarding it.

Handling Streaming Responses

AI agent systems frequently stream LLM output token by token. The gateway must support this without buffering:

@app.post("/api/v1/chat/stream")
async def chat_stream(
    request: Request,
    client: dict = Depends(verify_api_key),
):
    body = await request.json()

    async def event_generator():
        async with httpx.AsyncClient() as http:
            async with http.stream(
                "POST",
                f"{SERVICE_MAP['conversation']}/handle/stream",
                json={**body, "client_id": client["client_id"]},
                timeout=120.0,
            ) as resp:
                async for chunk in resp.aiter_bytes():
                    yield chunk

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
    )

Load Balancing Across Service Instances

When Kubernetes runs multiple replicas of a backend service, the gateway can rely on Kubernetes Service DNS for basic round-robin load balancing. For more sophisticated strategies — least connections, weighted routing, or canary deployments — use a service mesh like Istio or configure the gateway to maintain its own connection pool.

FAQ

Should I build a custom gateway or use an off-the-shelf solution like Kong or NGINX?

For most teams, start with an off-the-shelf gateway. Kong, NGINX, or AWS API Gateway handle routing, rate limiting, and auth out of the box. Build a custom gateway only when you need agent-specific logic at the gateway layer — for example, inspecting message content to route to different model backends or implementing custom token-based billing.

How do I handle authentication for WebSocket connections used in real-time agent chat?

Authenticate during the WebSocket handshake. The client sends the API key or JWT as a query parameter or in the initial HTTP upgrade headers. The gateway validates the token before upgrading the connection to WebSocket. Once upgraded, the connection is considered authenticated for its lifetime. Implement periodic re-validation if sessions are long-lived.

What rate limiting strategy works best for AI agent APIs?

Use tiered rate limiting. Apply a global requests-per-minute limit at the gateway level (e.g., 60 RPM). Then apply a separate tokens-per-minute limit at the conversation service level, since a single request to an LLM-powered agent can consume vastly different amounts of compute depending on the input length and output generation.


#APIGateway #Microservices #AgenticAI #Authentication #RateLimiting #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.