API Rate Limiting for AI Agent Services: Token Bucket, Sliding Window, and Adaptive Limits
Implement effective rate limiting for AI agent APIs using token bucket, sliding window, and adaptive algorithms. Learn per-user vs global strategies, proper response headers, and how to handle rate-limited AI agents gracefully.
Why Rate Limiting Is Critical for AI Agent APIs
AI agents are aggressive API consumers. Unlike humans who click buttons with seconds between actions, agents can fire hundreds of requests per minute when processing a batch of tasks or running a chain of tool calls. Without rate limiting, a single runaway agent can exhaust your LLM budget, overwhelm your database, and degrade service for every other consumer.
Rate limiting for AI agent services also has a cost dimension that traditional APIs lack. Each request might trigger an LLM inference call costing cents to dollars. A misconfigured agent loop hitting your API 1,000 times in a minute could burn through hundreds of dollars before anyone notices.
Token Bucket Algorithm
The token bucket is the most common rate limiting algorithm. It allows bursts while enforcing a long-term average rate. Imagine a bucket that fills with tokens at a steady rate. Each request consumes one token. If the bucket is empty, the request is rejected:
import time
from dataclasses import dataclass, field
@dataclass
class TokenBucket:
capacity: int
refill_rate: float # tokens per second
tokens: float = field(init=False)
last_refill: float = field(init=False)
def __post_init__(self):
self.tokens = float(self.capacity)
self.last_refill = time.monotonic()
def consume(self, count: int = 1) -> bool:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= count:
self.tokens -= count
return True
return False
def time_until_available(self) -> float:
if self.tokens >= 1:
return 0.0
return (1 - self.tokens) / self.refill_rate
# 100 requests per minute with burst of 20
bucket = TokenBucket(capacity=20, refill_rate=100 / 60)
The token bucket is ideal for AI agent APIs because it accommodates the bursty nature of agent activity — an agent might send 10 messages in rapid succession during a tool-call chain, then pause while waiting for results.
Sliding Window with Redis
For distributed systems where multiple API server instances share rate limits, use Redis-backed sliding window counters:
import redis.asyncio as redis
import time
redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)
async def sliding_window_check(
key: str,
limit: int,
window_seconds: int,
) -> tuple[bool, int, float]:
"""Returns (allowed, remaining, retry_after_seconds)."""
now = time.time()
window_start = now - window_seconds
pipe = redis_client.pipeline()
# Remove old entries outside the window
pipe.zremrangebyscore(key, 0, window_start)
# Count current entries
pipe.zcard(key)
# Add current request
pipe.zadd(key, {f"{now}": now})
# Set expiry on the key
pipe.expire(key, window_seconds)
results = await pipe.execute()
current_count = results[1]
if current_count >= limit:
# Find the oldest entry to calculate retry-after
oldest = await redis_client.zrange(key, 0, 0, withscores=True)
retry_after = (oldest[0][1] + window_seconds - now) if oldest else 1.0
return False, 0, retry_after
remaining = limit - current_count - 1
return True, remaining, 0.0
The sliding window uses a Redis sorted set where each request is a member scored by its timestamp. This gives you precise rate counting without the boundary issues of fixed windows.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
FastAPI Middleware Implementation
Wire rate limiting into your FastAPI app as middleware that sets standard response headers:
from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse
app = FastAPI()
RATE_LIMITS = {
"default": {"limit": 60, "window": 60},
"agent": {"limit": 200, "window": 60},
"admin": {"limit": 1000, "window": 60},
}
def get_rate_limit_tier(request: Request) -> str:
api_key = request.headers.get("X-API-Key", "")
# Look up tier from database in production
if api_key.startswith("agent_"):
return "agent"
if api_key.startswith("admin_"):
return "admin"
return "default"
@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
if request.url.path.startswith("/docs"):
return await call_next(request)
client_key = request.headers.get("X-API-Key", request.client.host)
tier = get_rate_limit_tier(request)
config = RATE_LIMITS[tier]
redis_key = f"ratelimit:{tier}:{client_key}"
allowed, remaining, retry_after = await sliding_window_check(
redis_key, config["limit"], config["window"]
)
if not allowed:
return JSONResponse(
status_code=429,
content={
"error": "rate_limit_exceeded",
"message": f"Rate limit of {config['limit']} requests "
f"per {config['window']}s exceeded",
"retry_after": round(retry_after, 1),
},
headers={
"Retry-After": str(int(retry_after) + 1),
"X-RateLimit-Limit": str(config["limit"]),
"X-RateLimit-Remaining": "0",
"X-RateLimit-Reset": str(int(time.time()) + int(retry_after) + 1),
},
)
response = await call_next(request)
response.headers["X-RateLimit-Limit"] = str(config["limit"])
response.headers["X-RateLimit-Remaining"] = str(remaining)
return response
Adaptive Rate Limiting
Static limits work for predictable traffic, but AI agent workloads can be spiky. Adaptive rate limiting adjusts limits based on system health:
import psutil
async def get_adaptive_limit(base_limit: int) -> int:
cpu_percent = psutil.cpu_percent(interval=0.1)
# Reduce limit when system is under load
if cpu_percent > 90:
return max(base_limit // 4, 5)
if cpu_percent > 75:
return base_limit // 2
if cpu_percent > 60:
return int(base_limit * 0.75)
return base_limit
Monitor CPU, memory, database connection pool utilization, and LLM API response times. When any metric exceeds a threshold, tighten the rate limits dynamically. This protects your system during load spikes without permanently restricting throughput during normal operation.
Client-Side Rate Limit Handling
Build rate limit awareness into your agent clients so they back off gracefully:
import httpx
import asyncio
async def agent_request_with_backoff(url: str, payload: dict) -> dict:
async with httpx.AsyncClient() as client:
for attempt in range(5):
response = await client.post(url, json=payload)
if response.status_code != 429:
return response.json()
retry_after = float(response.headers.get("Retry-After", "1"))
await asyncio.sleep(retry_after)
raise Exception("Rate limit not recovered after 5 retries")
FAQ
Should I rate limit per API key, per IP, or per agent ID?
Use per-API-key as the primary dimension since it maps to a billable entity. Add per-IP limiting as a secondary defense against unauthenticated abuse. Per-agent-ID limiting is useful when a single API key runs multiple agents and you want to prevent one agent from starving the others.
How do I set appropriate rate limits for AI agent consumers?
Start by measuring actual agent traffic patterns. Most agents have a natural request rate determined by their processing loop. Set limits at 2-3x the observed peak rate to accommodate legitimate bursts while catching runaway loops. Monitor 429 response rates — if legitimate agents are consistently hitting limits, your limits are too tight.
What is the difference between token bucket and sliding window in practice?
Token bucket allows larger bursts (up to the bucket capacity) followed by a steady flow. Sliding window enforces a strict count within any rolling time period. For AI agents, token bucket is usually better because agents naturally work in bursts — sending a flurry of requests during a tool-call chain, then pausing.
#RateLimiting #AIAgents #APISecurity #FastAPI #Redis #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.