API Gateway Pattern for AI Agent Microservices: Routing, Auth, and Rate Limiting
Design an API gateway for AI agent microservices that handles intelligent routing, authentication, and rate limiting while keeping backend services focused on their core responsibilities.
Why AI Agent Systems Need an API Gateway
When an AI agent system is split into microservices — a conversation manager, a tool execution engine, a RAG retrieval service, a memory store — clients should not need to know about any of this. A mobile app sending a chat message should hit one endpoint, not three different services in sequence.
An API gateway sits between external clients and internal services. It accepts all incoming requests through a single entry point, handles cross-cutting concerns like authentication and rate limiting, and routes requests to the appropriate backend service. Without a gateway, every microservice must independently implement auth verification, CORS handling, request logging, and rate limiting.
Gateway Architecture for Agent Systems
The gateway for an AI agent system has specific routing needs. A user message might need to reach the conversation service, while an admin request to update tool configurations routes to the tool management service. Streaming LLM responses require WebSocket or SSE support at the gateway level.
Here is a gateway implementation using FastAPI that routes to multiple agent services:
from fastapi import FastAPI, Request, HTTPException, Depends
from fastapi.responses import StreamingResponse
import httpx
import time
from collections import defaultdict
app = FastAPI(title="Agent Gateway")
SERVICE_MAP = {
"conversation": "http://conversation-manager:8000",
"tools": "http://tool-execution:8001",
"rag": "http://rag-retrieval:8002",
"memory": "http://memory-service:8003",
}
# --- Authentication middleware ---
async def verify_api_key(request: Request) -> dict:
api_key = request.headers.get("X-API-Key")
if not api_key:
raise HTTPException(status_code=401, detail="Missing API key")
# Validate against auth service or local cache
client_info = await auth_cache.get(api_key)
if not client_info:
async with httpx.AsyncClient() as client:
resp = await client.post(
"http://auth-service:8010/validate",
json={"api_key": api_key},
)
if resp.status_code != 200:
raise HTTPException(status_code=401, detail="Invalid API key")
client_info = resp.json()
await auth_cache.set(api_key, client_info, ttl=300)
return client_info
# --- Rate limiting ---
class RateLimiter:
def __init__(self, requests_per_minute: int = 60):
self.rpm = requests_per_minute
self.windows: dict[str, list[float]] = defaultdict(list)
def check(self, client_id: str) -> bool:
now = time.time()
window = self.windows[client_id]
# Remove timestamps older than 60 seconds
self.windows[client_id] = [
t for t in window if now - t < 60
]
if len(self.windows[client_id]) >= self.rpm:
return False
self.windows[client_id].append(now)
return True
rate_limiter = RateLimiter(requests_per_minute=60)
@app.post("/api/v1/chat")
async def chat_endpoint(
request: Request,
client: dict = Depends(verify_api_key),
):
if not rate_limiter.check(client["client_id"]):
raise HTTPException(
status_code=429,
detail="Rate limit exceeded",
)
body = await request.json()
async with httpx.AsyncClient(timeout=30.0) as http:
resp = await http.post(
f"{SERVICE_MAP['conversation']}/handle",
json={**body, "client_id": client["client_id"]},
)
return resp.json()
Route Configuration with Path-Based Routing
A clean routing strategy maps URL path prefixes to backend services:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
# gateway-routes.yaml
routes:
- prefix: /api/v1/chat
service: conversation
methods: [POST]
timeout: 30s
retry:
max_attempts: 2
retry_on: [502, 503]
- prefix: /api/v1/tools
service: tools
methods: [GET, POST, PUT, DELETE]
timeout: 10s
auth_required: true
roles: [admin]
- prefix: /api/v1/search
service: rag
methods: [POST]
timeout: 15s
rate_limit:
requests_per_minute: 30
- prefix: /api/v1/memory
service: memory
methods: [GET, POST, DELETE]
timeout: 5s
- prefix: /api/v1/chat/stream
service: conversation
methods: [POST]
protocol: sse
timeout: 120s
The gateway reads this configuration at startup and builds its routing table. The protocol: sse flag tells the gateway to handle the response as a server-sent event stream rather than buffering the full response before forwarding it.
Handling Streaming Responses
AI agent systems frequently stream LLM output token by token. The gateway must support this without buffering:
@app.post("/api/v1/chat/stream")
async def chat_stream(
request: Request,
client: dict = Depends(verify_api_key),
):
body = await request.json()
async def event_generator():
async with httpx.AsyncClient() as http:
async with http.stream(
"POST",
f"{SERVICE_MAP['conversation']}/handle/stream",
json={**body, "client_id": client["client_id"]},
timeout=120.0,
) as resp:
async for chunk in resp.aiter_bytes():
yield chunk
return StreamingResponse(
event_generator(),
media_type="text/event-stream",
)
Load Balancing Across Service Instances
When Kubernetes runs multiple replicas of a backend service, the gateway can rely on Kubernetes Service DNS for basic round-robin load balancing. For more sophisticated strategies — least connections, weighted routing, or canary deployments — use a service mesh like Istio or configure the gateway to maintain its own connection pool.
FAQ
Should I build a custom gateway or use an off-the-shelf solution like Kong or NGINX?
For most teams, start with an off-the-shelf gateway. Kong, NGINX, or AWS API Gateway handle routing, rate limiting, and auth out of the box. Build a custom gateway only when you need agent-specific logic at the gateway layer — for example, inspecting message content to route to different model backends or implementing custom token-based billing.
How do I handle authentication for WebSocket connections used in real-time agent chat?
Authenticate during the WebSocket handshake. The client sends the API key or JWT as a query parameter or in the initial HTTP upgrade headers. The gateway validates the token before upgrading the connection to WebSocket. Once upgraded, the connection is considered authenticated for its lifetime. Implement periodic re-validation if sessions are long-lived.
What rate limiting strategy works best for AI agent APIs?
Use tiered rate limiting. Apply a global requests-per-minute limit at the gateway level (e.g., 60 RPM). Then apply a separate tokens-per-minute limit at the conversation service level, since a single request to an LLM-powered agent can consume vastly different amounts of compute depending on the input length and output generation.
#APIGateway #Microservices #AgenticAI #Authentication #RateLimiting #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.