Building an AI Agent Gateway: Centralized Access Control, Logging, and Rate Limiting

The Case for a Dedicated Agent Gateway

When an organization runs five or ten AI agents, each handling its own authentication, logging, and rate limiting, inconsistencies creep in. One agent logs to stdout, another to a file. One checks API keys, another trusts a header value. One has no rate limiting at all, and a runaway client consumes the entire LLM budget in an afternoon.

An agent gateway sits between clients and agents, enforcing consistent policies across every request. It is not a new concept — API gateways like Kong and Envoy solve the same problem for microservices. But AI agents have unique requirements: token-based cost tracking, streaming response handling, and dynamic routing based on agent capabilities.

Gateway Architecture

The gateway operates as a reverse proxy with three processing stages: pre-request (authentication, authorization, rate limiting), routing (selecting the target agent), and post-response (logging, metrics, cost tracking).

from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import StreamingResponse
from datetime import datetime
import httpx
import time
import json

app = FastAPI(title="AI Agent Gateway")

AGENT_REGISTRY = {
    "support-agent": {
        "url": "http://support-agent:8001",
        "required_role": "agent_user",
        "rate_limit": 100,  # requests per minute
        "cost_center": "customer_success",
    },
    "analytics-agent": {
        "url": "http://analytics-agent:8002",
        "required_role": "analyst",
        "rate_limit": 50,
        "cost_center": "data_team",
    },
}


class GatewayMiddleware:
    def __init__(self, redis_client):
        self.redis = redis_client

    async def check_rate_limit(self, user_id: str, agent_id: str) -> bool:
        config = AGENT_REGISTRY[agent_id]
        key = f"rate:{user_id}:{agent_id}:{datetime.utcnow().strftime('%Y%m%d%H%M')}"
        current = await self.redis.incr(key)
        if current == 1:
            await self.redis.expire(key, 60)
        return current <= config["rate_limit"]

    async def check_authorization(self, user_roles: list, agent_id: str) -> bool:
        required = AGENT_REGISTRY[agent_id]["required_role"]
        return required in user_roles

    def build_audit_entry(
        self, request: Request, agent_id: str, user: dict,
        status: int, latency_ms: float, token_count: int
    ) -> dict:
        return {
            "timestamp": datetime.utcnow().isoformat(),
            "user_id": user["id"],
            "agent_id": agent_id,
            "method": request.method,
            "path": str(request.url),
            "status": status,
            "latency_ms": latency_ms,
            "token_count": token_count,
            "cost_center": AGENT_REGISTRY[agent_id]["cost_center"],
        }

Request Routing and Forwarding

The routing layer resolves which backend agent handles each request. It reads the agent identifier from the URL path, validates the agent exists, and forwards the request with context headers.

@app.post("/agents/{agent_id}/chat")
async def route_to_agent(agent_id: str, request: Request):
    if agent_id not in AGENT_REGISTRY:
        raise HTTPException(status_code=404, detail="Agent not found")

    user = request.state.user
    gateway = GatewayMiddleware(request.app.state.redis)

    if not await gateway.check_authorization(user["roles"], agent_id):
        raise HTTPException(status_code=403, detail="Insufficient permissions")

    if not await gateway.check_rate_limit(user["id"], agent_id):
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

    agent_url = AGENT_REGISTRY[agent_id]["url"]
    body = await request.json()

    start = time.monotonic()
    async with httpx.AsyncClient() as client:
        response = await client.post(
            f"{agent_url}/chat",
            json=body,
            headers={
                "X-User-Id": user["id"],
                "X-User-Roles": ",".join(user["roles"]),
                "X-Request-Id": request.state.request_id,
            },
            timeout=120.0,
        )
    latency_ms = (time.monotonic() - start) * 1000

    audit_entry = gateway.build_audit_entry(
        request, agent_id, user, response.status_code, latency_ms, 0
    )
    await log_audit(audit_entry)

    return response.json()

Policy Enforcement Patterns

Policies should be declarative and loaded from a configuration store, not hardcoded. This lets platform teams update rate limits, add IP allowlists, or restrict agents to certain departments without redeploying the gateway.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

A policy engine evaluates rules in order. The first matching deny rule rejects the request. If no deny rules match, the request proceeds. This approach scales cleanly because new policies are additive.

Streaming Response Handling

AI agents frequently stream responses token by token. The gateway must proxy these streams without buffering the entire response, or users experience unacceptable latency. Use chunked transfer encoding and forward each chunk as it arrives from the backend agent.

FAQ

Why not use an existing API gateway like Kong or Envoy for AI agents?

You can use them as a base layer for TLS termination and basic rate limiting. However, AI-specific features like token counting, cost allocation per LLM call, and dynamic agent routing based on capabilities require custom logic. A dedicated agent gateway adds this AI-aware layer on top of standard infrastructure.

How should the gateway handle agent failures and retries?

Implement circuit breakers per agent. If an agent returns five consecutive 500 errors within a minute, the gateway should stop forwarding requests and return a 503 with a retry-after header. This prevents cascading failures and protects shared LLM quotas from wasted retry storms.

Does the gateway add significant latency to agent requests?

The gateway adds 2 to 10 milliseconds for policy evaluation and routing. This is negligible compared to LLM inference times, which typically range from 500 milliseconds to several seconds. The tradeoff is well worth the centralized control and observability it provides.

#EnterpriseAI #APIGateway #RateLimiting #AccessControl #Observability #AgenticAI #LearnAI #AIEngineering

Building an AI Agent Gateway: Centralized Access Control, Logging, and Rate Limiting

The Case for a Dedicated Agent Gateway

Gateway Architecture

Request Routing and Forwarding

Policy Enforcement Patterns

Streaming Response Handling

FAQ

Why not use an existing API gateway like Kong or Envoy for AI agents?

How should the gateway handle agent failures and retries?

Does the gateway add significant latency to agent requests?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding