Skip to content
Learn Agentic AI14 min read0 views

Horizontal Scaling for AI Agents: Running Thousands of Concurrent Agent Sessions

Learn how to horizontally scale AI agent systems to handle thousands of concurrent sessions using stateless design, session affinity, load balancing, and auto-scaling strategies that maintain conversation coherence under heavy load.

Why Horizontal Scaling Matters for AI Agents

A single AI agent server can typically handle 50 to 200 concurrent conversations before response latency degrades. Each conversation involves holding context in memory, making LLM API calls that block for seconds, and streaming responses back to clients. Vertical scaling — adding more CPU and RAM to one machine — hits a ceiling quickly because the bottleneck is I/O-bound concurrency, not raw compute.

Horizontal scaling adds more server instances behind a load balancer so that thousands of concurrent sessions are distributed across a fleet. The challenge is that AI agent conversations are stateful — each turn depends on the history of previous turns. Designing around this statefulness is the core engineering problem.

Stateless Agent Design

The first principle is to externalize all conversation state. Instead of holding session data in memory on the server process, persist it to a shared store like Redis or a database after every turn:

import redis
import json
from fastapi import FastAPI, Request

app = FastAPI()
r = redis.Redis(host="redis-cluster", port=6379, decode_responses=True)

SESSION_TTL = 3600  # 1 hour

async def get_session(session_id: str) -> dict:
    data = r.get(f"agent:session:{session_id}")
    if data is None:
        return {"messages": [], "metadata": {}}
    return json.loads(data)

async def save_session(session_id: str, session: dict):
    r.setex(
        f"agent:session:{session_id}",
        SESSION_TTL,
        json.dumps(session),
    )

@app.post("/chat/{session_id}")
async def chat(session_id: str, request: Request):
    body = await request.json()
    session = await get_session(session_id)
    session["messages"].append({"role": "user", "content": body["message"]})

    # Call LLM with full conversation history
    response = await call_agent(session["messages"], session["metadata"])

    session["messages"].append({"role": "assistant", "content": response})
    await save_session(session_id, session)
    return {"response": response}

With this pattern, any server instance can handle any request for any session. The server itself holds no state between requests — it reads state from Redis, processes the turn, and writes state back.

Load Balancer Configuration

For stateless agent servers, round-robin or least-connections load balancing works well. However, if your agent uses WebSocket streaming, you need session affinity (sticky sessions) for the duration of a single streaming response:

# Kubernetes Ingress with sticky sessions for WebSocket
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: agent-ingress
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "agent-route"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "300"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "120"
spec:
  rules:
    - host: agents.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: agent-service
                port:
                  number: 8000

The cookie-based affinity ensures a client reconnects to the same pod during an active streaming session, while new sessions are distributed evenly across the fleet.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Auto-Scaling Based on Concurrent Connections

CPU-based auto-scaling is a poor fit for AI agent workloads because servers spend most of their time waiting on LLM API calls. Instead, scale based on active connection count or request concurrency:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-deployment
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: active_ws_connections
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 5
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

This scales up aggressively when each pod averages over 100 active WebSocket connections and scales down conservatively to avoid dropping live conversations.

Graceful Shutdown and Connection Draining

When scaling down, pods must finish in-flight conversations before terminating. Configure a preStop hook and a generous termination grace period:

spec:
  terminationGracePeriodSeconds: 120
  containers:
    - name: agent
      lifecycle:
        preStop:
          exec:
            command:
              - /bin/sh
              - -c
              - "curl -s localhost:8000/drain && sleep 90"

The drain endpoint tells the server to stop accepting new connections and wait for active conversations to complete their current turn before shutting down.

FAQ

How many concurrent sessions can a single Python agent server handle?

With asyncio and an async framework like FastAPI, a single server can handle 100 to 300 concurrent sessions when the primary bottleneck is waiting on LLM API responses. The actual limit depends on memory per session (typically 50 to 200 KB of conversation context) and the timeout duration of LLM calls.

Should I use sticky sessions or fully stateless routing?

Use fully stateless routing when you externalize all session state to Redis or a database — this gives maximum flexibility for scaling. Use sticky sessions only for the duration of a single WebSocket streaming response, not for the entire conversation lifecycle.

What happens to active conversations during a deployment rollout?

Configure rolling updates with maxSurge: 1 and maxUnavailable: 0 so that new pods come up before old ones terminate. Combined with connection draining and a termination grace period, active conversations complete their current turn on the old pod, and the next turn routes to a new pod.


#HorizontalScaling #AIAgents #LoadBalancing #AutoScaling #DistributedSystems #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.