FastAPI Middleware for AI Agents: Logging, Auth, and Rate Limiting
Build a production middleware stack for AI agent APIs in FastAPI. Covers structured request logging, Bearer token authentication, sliding window rate limiting, and CORS configuration for agent frontends.
The Middleware Stack for AI Agent APIs
Middleware sits between the incoming HTTP request and your endpoint handler. For AI agent backends, a proper middleware stack handles cross-cutting concerns: logging every request for debugging, authenticating callers before they reach agent endpoints, rate limiting to prevent LLM cost overruns, and adding CORS headers for browser-based agent frontends.
FastAPI middleware executes in the order it is added, wrapping your endpoint like layers of an onion. The first middleware added is the outermost layer, meaning it sees the request first and the response last.
Structured Request Logging
Every AI agent request should be logged with enough context to debug issues in production. This middleware captures timing, status codes, and request metadata:
import time
import uuid
import logging
from fastapi import Request
logger = logging.getLogger("agent_api")
@app.middleware("http")
async def logging_middleware(request: Request, call_next):
request_id = str(uuid.uuid4())[:8]
request.state.request_id = request_id
start_time = time.monotonic()
# Log request
logger.info(
"request_started",
extra={
"request_id": request_id,
"method": request.method,
"path": request.url.path,
"client_ip": request.client.host,
},
)
try:
response = await call_next(request)
duration_ms = (time.monotonic() - start_time) * 1000
logger.info(
"request_completed",
extra={
"request_id": request_id,
"status_code": response.status_code,
"duration_ms": round(duration_ms, 2),
"path": request.url.path,
},
)
response.headers["X-Request-ID"] = request_id
response.headers["X-Response-Time"] = f"{duration_ms:.0f}ms"
return response
except Exception as e:
duration_ms = (time.monotonic() - start_time) * 1000
logger.error(
"request_failed",
extra={
"request_id": request_id,
"error": str(e),
"duration_ms": round(duration_ms, 2),
},
)
raise
The X-Request-ID header lets clients and support teams correlate frontend errors with backend logs.
Token-Based Authentication Middleware
AI agent APIs should authenticate every request. This middleware validates Bearer tokens and attaches user context to the request:
from fastapi import Request, HTTPException
from fastapi.responses import JSONResponse
import jwt
SKIP_AUTH_PATHS = {"/health", "/docs", "/openapi.json"}
@app.middleware("http")
async def auth_middleware(request: Request, call_next):
if request.url.path in SKIP_AUTH_PATHS:
return await call_next(request)
auth_header = request.headers.get("Authorization")
if not auth_header or not auth_header.startswith("Bearer "):
return JSONResponse(
status_code=401,
content={"error": "Missing or invalid auth token"},
)
token = auth_header.split(" ", 1)[1]
try:
payload = jwt.decode(
token,
settings.jwt_secret,
algorithms=["HS256"],
)
request.state.user_id = payload["sub"]
request.state.user_tier = payload.get("tier", "free")
except jwt.ExpiredSignatureError:
return JSONResponse(
status_code=401,
content={"error": "Token expired"},
)
except jwt.InvalidTokenError:
return JSONResponse(
status_code=401,
content={"error": "Invalid token"},
)
return await call_next(request)
Notice this uses JSONResponse instead of raising HTTPException. Inside middleware, raising exceptions can bypass other middleware layers. Returning a response directly is safer.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Sliding Window Rate Limiting
AI agent APIs are expensive because every request triggers LLM calls. Rate limiting prevents abuse and cost overruns. This implementation uses Redis for a sliding window algorithm:
import redis.asyncio as redis
redis_client = redis.from_url("redis://localhost:6379/2")
RATE_LIMITS = {
"free": {"requests": 20, "window_seconds": 3600},
"pro": {"requests": 200, "window_seconds": 3600},
"enterprise": {"requests": 2000, "window_seconds": 3600},
}
@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
if request.url.path in SKIP_AUTH_PATHS:
return await call_next(request)
user_id = getattr(request.state, "user_id", "anonymous")
user_tier = getattr(request.state, "user_tier", "free")
limits = RATE_LIMITS[user_tier]
key = f"ratelimit:{user_id}"
now = time.time()
window_start = now - limits["window_seconds"]
pipe = redis_client.pipeline()
# Remove old entries outside the window
pipe.zremrangebyscore(key, 0, window_start)
# Count remaining entries
pipe.zcard(key)
# Add current request
pipe.zadd(key, {str(now): now})
# Set expiry on the key
pipe.expire(key, limits["window_seconds"])
results = await pipe.execute()
request_count = results[1]
if request_count >= limits["requests"]:
retry_after = int(limits["window_seconds"])
return JSONResponse(
status_code=429,
content={
"error": "Rate limit exceeded",
"limit": limits["requests"],
"window": f"{limits['window_seconds']}s",
"retry_after": retry_after,
},
headers={"Retry-After": str(retry_after)},
)
response = await call_next(request)
remaining = limits["requests"] - request_count - 1
response.headers["X-RateLimit-Limit"] = str(limits["requests"])
response.headers["X-RateLimit-Remaining"] = str(max(0, remaining))
return response
The Redis sorted set tracks each request timestamp. On each new request, old entries outside the window are pruned, the current count is checked, and the new request is added. This gives an accurate sliding window rather than a fixed window that resets.
CORS Configuration
Browser-based agent frontends need proper CORS headers:
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
CORSMiddleware,
allow_origins=[
"https://app.yourdomain.com",
"http://localhost:3000",
],
allow_credentials=True,
allow_methods=["GET", "POST", "PUT", "DELETE"],
allow_headers=["Authorization", "Content-Type"],
expose_headers=[
"X-Request-ID",
"X-RateLimit-Remaining",
],
)
Add CORS middleware last so it is the outermost layer and properly handles preflight OPTIONS requests before any other middleware runs.
FAQ
What is the correct order for middleware in a FastAPI AI agent API?
Add middleware in this order: CORS (outermost, handles preflight), logging (captures all requests including rejected ones), authentication (rejects unauthenticated requests early), rate limiting (checks limits for authenticated users). Since FastAPI middleware wraps in reverse order of addition, add CORS last in your code so it executes first. This ensures OPTIONS preflight requests get CORS headers without triggering auth or rate limiting.
Should I use middleware or Dependencies for authentication?
Middleware is better when every endpoint needs authentication because it runs automatically without any per-endpoint configuration. Dependencies are better when only some endpoints need auth, or when different endpoints need different auth levels. A common pattern is using middleware for basic token validation and a dependency for fine-grained permission checks on specific endpoints.
How do I handle rate limiting for streaming endpoints?
Count the initial request, not individual streamed chunks. A streaming response that sends 500 tokens is still one API request from a rate limiting perspective. However, you may want to track token usage separately for billing purposes. Use the logging middleware to record total tokens consumed per request and apply token-based quotas as a separate check from request-count rate limiting.
#FastAPI #Middleware #Authentication #RateLimiting #AIAgents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.