Building an Agentic AI API Gateway: Routing, Authentication, and Load Balancing
Design an API gateway for agentic AI with multi-model routing, API key management, rate limiting, WebSocket proxy, and health-based routing.
The API Gateway as the Front Door to Your Agent System
Every production agentic AI system needs a single entry point that handles authentication, routes requests to the right agent, manages rate limits, proxies WebSocket connections for streaming responses, and provides a unified interface regardless of how many agents, models, or services sit behind it.
Off-the-shelf API gateways like Kong, APISIX, or AWS API Gateway handle basic HTTP routing well. But agentic AI workloads have unique requirements: long-lived WebSocket connections, model-aware routing, token-based rate limiting (not just request-based), and health checks that verify LLM connectivity. You need a gateway layer designed for these constraints.
At CallSphere, our API gateway handles thousands of concurrent agent conversations across voice, chat, and API channels. This guide covers the architecture patterns we use.
High-Level Architecture
+-------------------+
| Load Balancer |
| (L4 / L7) |
+--------+----------+
|
+--------v----------+
| API Gateway |
| |
| - Auth middleware |
| - Rate limiter |
| - Model router |
| - WS proxy |
| - Health checker |
+--------+----------+
|
+--------------+--------------+
| | |
+--------v---+ +------v-----+ +-----v------+
| Triage | | Specialist | | Tool |
| Agent | | Agents | | Executors |
| Service | | Service | | Service |
+------------+ +------------+ +------------+
The gateway sits between the load balancer and your agent services. It is the only component exposed to the internet. All agent services live in a private network.
Multi-Model Routing
Different conversations and different stages within a conversation may require different LLM models. A simple question might be handled by a fast, inexpensive model while a complex reasoning task routes to a more capable one.
Route-Based Model Selection
MODEL_ROUTES = {
"/v1/agents/triage": {
"default_model": "claude-3-5-haiku-20241022",
"escalation_model": "claude-sonnet-4-20250514",
"service": "triage-agent.agentic-ai.svc.cluster.local:8080",
},
"/v1/agents/specialist/*": {
"default_model": "claude-sonnet-4-20250514",
"service": "specialist-agent.agentic-ai.svc.cluster.local:8080",
},
"/v1/agents/analysis": {
"default_model": "claude-opus-4-20250514",
"service": "analysis-agent.agentic-ai.svc.cluster.local:8080",
},
}
async def route_request(request):
route_config = match_route(request.path, MODEL_ROUTES)
if not route_config:
return JSONResponse(status_code=404, content={"error": "Unknown agent endpoint"})
requested_tier = request.headers.get("X-Model-Tier", "default")
model = route_config.get(f"{requested_tier}_model", route_config["default_model"])
proxied_headers = dict(request.headers)
proxied_headers["X-Selected-Model"] = model
proxied_headers["X-Route-Name"] = request.path
return await proxy_to_service(route_config["service"], request, proxied_headers)
Cost-Aware Model Routing
A more sophisticated approach examines the request content to decide the model tier:
async def cost_aware_route(request, route_config):
body = await request.json()
message = body.get("message", "")
complexity_signals = {
"high": ["analyze", "compare", "evaluate", "calculate", "research", "complex"],
"low": ["yes", "no", "ok", "thanks", "hello", "what time", "bye"],
}
message_lower = message.lower()
if any(word in message_lower for word in complexity_signals["low"]):
return route_config.get("fast_model", route_config["default_model"])
if any(word in message_lower for word in complexity_signals["high"]):
return route_config.get("escalation_model", route_config["default_model"])
return route_config["default_model"]
In practice, you want the triage agent itself (running on a fast model) to classify the complexity, then route the actual processing to the appropriate model. The gateway handles the mechanical routing; the agent makes the intelligent decision.
API Key Management and Authentication
Multi-Tenant API Key Architecture
from fastapi import FastAPI, Request, HTTPException
import hashlib
async def authenticate(request: Request):
auth_header = request.headers.get("Authorization", "")
if not auth_header.startswith("Bearer "):
raise HTTPException(status_code=401, detail="Missing bearer token")
api_key = auth_header[7:]
key_hash = hashlib.sha256(api_key.encode()).hexdigest()
# Look up key by hash (never store plaintext keys)
key_record = await db.fetch_one(
"SELECT * FROM api_keys WHERE key_hash = $1 AND is_active = true",
key_hash,
)
if not key_record:
raise HTTPException(status_code=401, detail="Invalid API key")
if key_record["expires_at"] and key_record["expires_at"] < datetime.utcnow():
raise HTTPException(status_code=401, detail="API key expired")
# Update last_used_at asynchronously
asyncio.create_task(
db.execute("UPDATE api_keys SET last_used_at = NOW() WHERE id = $1", key_record["id"])
)
return {
"tenant_id": key_record["tenant_id"],
"scopes": key_record["scopes"],
"rate_limit_tier": key_record["rate_limit_tier"],
}
Scope-Based Access Control
Different API keys should have access to different agents and capabilities:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
SCOPE_DEFINITIONS = {
"agents:triage": "Access the triage agent",
"agents:specialist:*": "Access all specialist agents",
"agents:specialist:billing": "Access only the billing agent",
"tools:execute": "Direct tool execution bypassing agent",
"conversations:read": "Read conversation history",
"conversations:write": "Create new conversations",
"admin:metrics": "Access usage metrics and analytics",
}
def check_scope(required_scope: str, granted_scopes: list) -> bool:
for scope in granted_scopes:
if scope == required_scope:
return True
if scope.endswith(":*"):
prefix = scope[:-2]
if required_scope.startswith(prefix):
return True
return False
Rate Limiting Per Tenant
Request-count-based rate limiting alone is insufficient for agentic AI. A single complex conversation can consume 100K tokens while 50 simple interactions use 5K tokens total. Rate limit on both requests and token consumption.
import redis.asyncio as aioredis
class AgentRateLimiter:
TIERS = {
"free": {"rpm": 20, "tpm": 40_000, "concurrent": 2},
"starter": {"rpm": 60, "tpm": 200_000, "concurrent": 5},
"pro": {"rpm": 300, "tpm": 1_000_000, "concurrent": 20},
"enterprise": {"rpm": 1000, "tpm": 5_000_000, "concurrent": 50},
}
def __init__(self, redis_client):
self.redis = redis_client
async def check_rate_limit(self, tenant_id: str, tier: str) -> dict:
limits = self.TIERS.get(tier, self.TIERS["free"])
now = int(time.time())
minute_key = f"rl:{tenant_id}:rpm:{now // 60}"
concurrent_key = f"rl:{tenant_id}:concurrent"
pipe = self.redis.pipeline()
pipe.incr(minute_key)
pipe.expire(minute_key, 120)
pipe.get(concurrent_key)
results = await pipe.execute()
current_rpm = results[0]
current_concurrent = int(results[2] or 0)
if current_rpm > limits["rpm"]:
return {"allowed": False, "reason": "RPM limit exceeded",
"retry_after": 60 - (now % 60)}
if current_concurrent >= limits["concurrent"]:
return {"allowed": False, "reason": "Concurrent limit reached",
"retry_after": 5}
return {"allowed": True, "remaining_rpm": limits["rpm"] - current_rpm}
WebSocket Proxy for Streaming Responses
Agent conversations over WebSocket enable streaming token-by-token responses, real-time status updates (thinking, calling tool, handing off), and bidirectional communication.
from fastapi import WebSocket, WebSocketDisconnect
@app.websocket("/v1/ws/conversations/{conversation_id}")
async def websocket_agent_proxy(websocket: WebSocket, conversation_id: str):
token = websocket.query_params.get("token")
auth = await authenticate_ws_token(token)
if not auth:
await websocket.close(code=4001, reason="Unauthorized")
return
await websocket.accept()
await rate_limiter.increment_concurrent(auth["tenant_id"])
try:
agent_ws = await connect_to_agent_service(conversation_id, auth["tenant_id"])
async def forward_to_agent():
async for message in websocket.iter_text():
await agent_ws.send(message)
async def forward_to_client():
async for message in agent_ws:
await websocket.send_text(message.data)
await asyncio.gather(forward_to_agent(), forward_to_client())
except WebSocketDisconnect:
pass
finally:
await rate_limiter.decrement_concurrent(auth["tenant_id"])
await agent_ws.close()
Health-Based Routing
Route requests to agent services based on health, not just availability. An agent service can be running but degraded if its LLM provider is slow, its tools are failing, or it is near memory limits.
class HealthBasedRouter:
def __init__(self):
self.service_health = {}
async def poll_health(self):
while True:
for service_name, endpoints in self.services.items():
for endpoint in endpoints:
try:
resp = await http_client.get(f"http://{endpoint}/health/ready", timeout=5)
health = resp.json()
self.service_health[endpoint] = self.calculate_score(health)
except Exception:
self.service_health[endpoint] = 0.0
await asyncio.sleep(10)
def calculate_score(self, health: dict) -> float:
score = 1.0
if health.get("llm_latency_ms", 0) > 5000:
score -= 0.3
if health.get("error_rate_5m", 0) > 0.05:
score -= 0.4
if health.get("memory_usage_pct", 0) > 85:
score -= 0.2
if not health.get("tools_available", True):
score -= 0.5
return max(score, 0.0)
async def select_endpoint(self, service_name: str) -> str:
endpoints = self.services.get(service_name, [])
healthy = [
(ep, self.service_health.get(ep, 0.0))
for ep in endpoints
if self.service_health.get(ep, 0.0) > 0.3
]
if not healthy:
raise HTTPException(503, "No healthy endpoints available")
# Weighted random favoring healthier instances
total = sum(s for _, s in healthy)
r = random.uniform(0, total)
cumulative = 0
for ep, score in healthy:
cumulative += score
if r <= cumulative:
return ep
return healthy[-1][0]
Gateway Response Headers
Return useful metadata so clients can track usage and debug issues:
X-Request-Id: 550e8400-e29b-41d4-a716-446655440000
X-Tenant-Id: acme-corp
X-Model-Used: claude-sonnet-4-20250514
X-Token-Count: 1847
X-Rate-Limit-Remaining: 42
X-Rate-Limit-Reset: 1710432000
X-Agent-Name: billing-agent
X-Conversation-Id: conv_abc123
X-Gateway-Latency-Ms: 12
These headers provide client-side observability without requiring additional API calls to check usage or trace requests through the system.
Security Considerations
- Never log request or response bodies that may contain PII or sensitive data. Log metadata only (request ID, tenant ID, status code, latency).
- Use mTLS between the gateway and backend agent services. The gateway terminates external TLS and initiates internal mTLS.
- Rotate API keys automatically. Set expiration dates and notify tenants 30 days before expiry.
- Implement request signing for webhook callbacks from agents to prevent replay attacks.
- Rate limit by IP in addition to API key to prevent credential-stuffing attacks on the authentication layer.
Frequently Asked Questions
Should I build a custom API gateway or use an off-the-shelf solution?
Start with an existing gateway like Kong, APISIX, or Envoy and extend it with custom plugins for model routing, token-based rate limiting, and agent health checks. Building from scratch only makes sense if your routing logic is highly specialized and cannot be expressed as plugins. Most teams waste months rebuilding basic gateway functionality that already exists in mature projects.
How do I handle WebSocket connections through a load balancer?
Use a Layer 4 (TCP) load balancer or configure your Layer 7 load balancer to support WebSocket upgrade. Enable sticky sessions so a client's WebSocket connection always reaches the same gateway instance. Alternatively, use connection-ID-based routing through Redis so any gateway instance can resume a session.
What is the best rate limiting strategy for agentic AI APIs?
Use a sliding window algorithm with dual limits: requests per minute (RPM) and tokens per minute (TPM). RPM prevents request flooding while TPM prevents cost overruns from complex queries. Add a concurrent conversation limit to prevent a single tenant from monopolizing agent capacity. Store counters in Redis for low-latency checks across gateway instances.
How do I handle gateway failover without dropping active conversations?
Run at least 3 gateway instances behind a load balancer. Store all rate limiting state and active connection metadata in Redis rather than in-process memory so any gateway instance can enforce limits and resume sessions. A gateway that cannot reach Redis or backend services should be removed from the load balancer pool via health checks.
How do I version my agent API without breaking existing integrations?
Use URL path versioning (/v1/agents/, /v2/agents/) for major breaking changes and header-based versioning (X-Agent-Version: 2026-03-14) for minor prompt or behavior changes. The gateway routes to the appropriate agent deployment based on the version. Always maintain at least two major versions simultaneously and give tenants a 90-day deprecation window.
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.