Horizontal Scaling for AI Agents: Running Thousands of Concurrent Agent Sessions
Learn how to horizontally scale AI agent systems to handle thousands of concurrent sessions using stateless design, session affinity, load balancing, and auto-scaling strategies that maintain conversation coherence under heavy load.
Why Horizontal Scaling Matters for AI Agents
A single AI agent server can typically handle 50 to 200 concurrent conversations before response latency degrades. Each conversation involves holding context in memory, making LLM API calls that block for seconds, and streaming responses back to clients. Vertical scaling — adding more CPU and RAM to one machine — hits a ceiling quickly because the bottleneck is I/O-bound concurrency, not raw compute.
Horizontal scaling adds more server instances behind a load balancer so that thousands of concurrent sessions are distributed across a fleet. The challenge is that AI agent conversations are stateful — each turn depends on the history of previous turns. Designing around this statefulness is the core engineering problem.
Stateless Agent Design
The first principle is to externalize all conversation state. Instead of holding session data in memory on the server process, persist it to a shared store like Redis or a database after every turn:
import redis
import json
from fastapi import FastAPI, Request
app = FastAPI()
r = redis.Redis(host="redis-cluster", port=6379, decode_responses=True)
SESSION_TTL = 3600 # 1 hour
async def get_session(session_id: str) -> dict:
data = r.get(f"agent:session:{session_id}")
if data is None:
return {"messages": [], "metadata": {}}
return json.loads(data)
async def save_session(session_id: str, session: dict):
r.setex(
f"agent:session:{session_id}",
SESSION_TTL,
json.dumps(session),
)
@app.post("/chat/{session_id}")
async def chat(session_id: str, request: Request):
body = await request.json()
session = await get_session(session_id)
session["messages"].append({"role": "user", "content": body["message"]})
# Call LLM with full conversation history
response = await call_agent(session["messages"], session["metadata"])
session["messages"].append({"role": "assistant", "content": response})
await save_session(session_id, session)
return {"response": response}
With this pattern, any server instance can handle any request for any session. The server itself holds no state between requests — it reads state from Redis, processes the turn, and writes state back.
Load Balancer Configuration
For stateless agent servers, round-robin or least-connections load balancing works well. However, if your agent uses WebSocket streaming, you need session affinity (sticky sessions) for the duration of a single streaming response:
# Kubernetes Ingress with sticky sessions for WebSocket
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: agent-ingress
annotations:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "agent-route"
nginx.ingress.kubernetes.io/session-cookie-max-age: "300"
nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
nginx.ingress.kubernetes.io/proxy-send-timeout: "120"
spec:
rules:
- host: agents.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: agent-service
port:
number: 8000
The cookie-based affinity ensures a client reconnects to the same pod during an active streaming session, while new sessions are distributed evenly across the fleet.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Auto-Scaling Based on Concurrent Connections
CPU-based auto-scaling is a poor fit for AI agent workloads because servers spend most of their time waiting on LLM API calls. Instead, scale based on active connection count or request concurrency:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: agent-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: agent-deployment
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: active_ws_connections
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 5
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 120
This scales up aggressively when each pod averages over 100 active WebSocket connections and scales down conservatively to avoid dropping live conversations.
Graceful Shutdown and Connection Draining
When scaling down, pods must finish in-flight conversations before terminating. Configure a preStop hook and a generous termination grace period:
spec:
terminationGracePeriodSeconds: 120
containers:
- name: agent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "curl -s localhost:8000/drain && sleep 90"
The drain endpoint tells the server to stop accepting new connections and wait for active conversations to complete their current turn before shutting down.
FAQ
How many concurrent sessions can a single Python agent server handle?
With asyncio and an async framework like FastAPI, a single server can handle 100 to 300 concurrent sessions when the primary bottleneck is waiting on LLM API responses. The actual limit depends on memory per session (typically 50 to 200 KB of conversation context) and the timeout duration of LLM calls.
Should I use sticky sessions or fully stateless routing?
Use fully stateless routing when you externalize all session state to Redis or a database — this gives maximum flexibility for scaling. Use sticky sessions only for the duration of a single WebSocket streaming response, not for the entire conversation lifecycle.
What happens to active conversations during a deployment rollout?
Configure rolling updates with maxSurge: 1 and maxUnavailable: 0 so that new pods come up before old ones terminate. Combined with connection draining and a termination grace period, active conversations complete their current turn on the old pod, and the next turn routes to a new pod.
#HorizontalScaling #AIAgents #LoadBalancing #AutoScaling #DistributedSystems #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.