Scaling AI Agents to 10,000 Concurrent Users: Architecture Patterns and Load Testing

Why Agent Systems Break at Scale

Scaling a traditional REST API to 10,000 concurrent users is a solved problem: add stateless application servers behind a load balancer, scale the database with read replicas, and cache aggressively. Scaling an AI agent system is fundamentally harder because agents are stateful, long-running, and computationally expensive.

A single agent interaction might involve 5-15 LLM calls, each taking 1-10 seconds. The agent maintains conversational state across these calls. It holds connections to external tools, databases, and APIs. And it consumes significant memory for context windows that can exceed 100K tokens.

At 10,000 concurrent users, you are not managing 10,000 HTTP request-response cycles. You are managing 10,000 concurrent state machines, each executing multi-step workflows with variable latency and resource consumption. This post covers the architecture patterns that make this possible.

The Core Architecture: Separating Concerns

The first principle of agent scaling is separating the components that scale differently:

Gateway Layer: Handles WebSocket connections, authentication, rate limiting. Scales horizontally with minimal state.

Router Layer: Classifies incoming requests and dispatches to the appropriate agent pool. Lightweight, fast, scales easily.

Agent Worker Pool: Executes agent logic. This is the bottleneck. Each worker manages one or more agent sessions, making LLM calls and tool invocations. Scaling requires careful resource management.

State Store: Persists conversation state, agent memory, and session data. Must handle high read/write throughput with low latency.

Tool Execution Layer: Manages connections to external services, databases, and APIs. Needs connection pooling and circuit breaking.

# Agent scaling architecture with FastAPI and Redis

import asyncio
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from redis.asyncio import Redis
from dataclasses import dataclass, field
from typing import Optional
import json
import uuid

@dataclass
class AgentSession:
    session_id: str
    user_id: str
    agent_type: str
    messages: list[dict] = field(default_factory=list)
    state: dict = field(default_factory=dict)
    created_at: float = 0.0
    last_active: float = 0.0

class AgentSessionManager:
    """Manages agent sessions with Redis-backed state."""

    def __init__(self, redis: Redis, ttl: int = 3600):
        self.redis = redis
        self.ttl = ttl

    async def create_session(self, user_id: str, agent_type: str) -> AgentSession:
        session = AgentSession(
            session_id=str(uuid.uuid4()),
            user_id=user_id,
            agent_type=agent_type,
        )
        await self._save(session)
        return session

    async def get_session(self, session_id: str) -> Optional[AgentSession]:
        data = await self.redis.get(f"agent:session:{session_id}")
        if not data:
            return None
        return AgentSession(**json.loads(data))

    async def update_session(self, session: AgentSession):
        await self._save(session)

    async def _save(self, session: AgentSession):
        key = f"agent:session:{session.session_id}"
        await self.redis.setex(
            key,
            self.ttl,
            json.dumps({
                "session_id": session.session_id,
                "user_id": session.user_id,
                "agent_type": session.agent_type,
                "messages": session.messages[-50:],  # Keep last 50 messages
                "state": session.state,
                "created_at": session.created_at,
                "last_active": session.last_active,
            })
        )

app = FastAPI()
redis = Redis.from_url("redis://redis-cluster:6379/0")
session_mgr = AgentSessionManager(redis)

# Connection tracking for backpressure
active_connections: dict[str, WebSocket] = {}
MAX_CONCURRENT_SESSIONS = 10000

@app.websocket("/ws/agent/{agent_type}")
async def agent_websocket(websocket: WebSocket, agent_type: str):
    if len(active_connections) >= MAX_CONCURRENT_SESSIONS:
        await websocket.close(code=1013, reason="Server at capacity")
        return

    await websocket.accept()
    session = await session_mgr.create_session(
        user_id=websocket.headers.get("x-user-id", "anonymous"),
        agent_type=agent_type
    )
    active_connections[session.session_id] = websocket

    try:
        while True:
            message = await websocket.receive_text()
            # Dispatch to agent worker pool via queue
            await redis.lpush(
                f"agent:queue:{agent_type}",
                json.dumps({
                    "session_id": session.session_id,
                    "message": message,
                })
            )
            # Wait for response on session-specific channel
            pubsub = redis.pubsub()
            await pubsub.subscribe(f"agent:response:{session.session_id}")
            async for msg in pubsub.listen():
                if msg["type"] == "message":
                    await websocket.send_text(msg["data"].decode())
                    break
            await pubsub.unsubscribe()
    except WebSocketDisconnect:
        pass
    finally:
        active_connections.pop(session.session_id, None)

Connection Pooling for LLM API Calls

The single largest bottleneck in agent scaling is LLM API calls. Each agent session makes multiple calls, and these calls are the slowest operations in the pipeline (1-10 seconds each). Without careful connection management, you will exhaust your HTTP connection pool long before you hit CPU or memory limits.

# LLM connection pool with concurrency limiting and retry logic

import httpx
import asyncio
from dataclasses import dataclass
from typing import Any

@dataclass
class LLMPoolConfig:
    max_connections: int = 200
    max_keepalive: int = 100
    timeout_seconds: float = 60.0
    max_concurrent_requests: int = 150
    retry_attempts: int = 3
    retry_backoff_base: float = 1.0

class LLMConnectionPool:
    def __init__(self, config: LLMPoolConfig):
        self.config = config
        self.semaphore = asyncio.Semaphore(config.max_concurrent_requests)
        self.client = httpx.AsyncClient(
            limits=httpx.Limits(
                max_connections=config.max_connections,
                max_keepalive_connections=config.max_keepalive,
            ),
            timeout=httpx.Timeout(config.timeout_seconds),
        )
        self._request_count = 0
        self._error_count = 0

    async def chat_completion(
        self, messages: list[dict], model: str, **kwargs
    ) -> dict:
        async with self.semaphore:
            self._request_count += 1
            for attempt in range(self.config.retry_attempts):
                try:
                    response = await self.client.post(
                        "https://api.anthropic.com/v1/messages",
                        json={
                            "model": model,
                            "messages": messages,
                            "max_tokens": kwargs.get("max_tokens", 4096),
                            **kwargs,
                        },
                        headers={
                            "x-api-key": self._get_api_key(),
                            "anthropic-version": "2023-06-01",
                        },
                    )

                    if response.status_code == 429:
                        # Rate limited: exponential backoff
                        wait = self.config.retry_backoff_base * (2 ** attempt)
                        await asyncio.sleep(wait)
                        continue

                    if response.status_code == 529:
                        # Overloaded: back off more aggressively
                        wait = self.config.retry_backoff_base * (3 ** attempt)
                        await asyncio.sleep(wait)
                        continue

                    response.raise_for_status()
                    return response.json()

                except httpx.TimeoutException:
                    if attempt == self.config.retry_attempts - 1:
                        self._error_count += 1
                        raise

            raise RuntimeError("Max retries exceeded")

    @property
    def utilization(self) -> float:
        """Current pool utilization (0.0 to 1.0)."""
        active = self.config.max_concurrent_requests - self.semaphore._value
        return active / self.config.max_concurrent_requests

    def _get_api_key(self) -> str:
        import os
        return os.environ["ANTHROPIC_API_KEY"]

Horizontal Scaling with Worker Pools

Agent workers consume significant resources: memory for context windows, CPU for response parsing, and network I/O for tool calls. Scaling horizontally means running multiple worker processes across multiple machines, with a message queue distributing work.

# Agent worker that processes tasks from a Redis queue

import asyncio
import signal
from typing import Callable

class AgentWorker:
    """
    A worker process that pulls agent tasks from a Redis queue
    and executes them. Run multiple instances for horizontal scaling.
    """

    def __init__(
        self,
        redis: Redis,
        llm_pool: LLMConnectionPool,
        agent_factory: Callable,
        queue_name: str,
        max_concurrent: int = 50,
    ):
        self.redis = redis
        self.llm_pool = llm_pool
        self.agent_factory = agent_factory
        self.queue_name = queue_name
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.running = True
        self.active_tasks = 0

    async def start(self):
        """Main worker loop: pull tasks and process them."""
        # Graceful shutdown handling
        loop = asyncio.get_event_loop()
        for sig in (signal.SIGINT, signal.SIGTERM):
            loop.add_signal_handler(sig, self._shutdown)

        while self.running:
            try:
                # Block-wait for a task (with timeout for shutdown checks)
                result = await self.redis.brpop(
                    self.queue_name, timeout=5
                )
                if result is None:
                    continue

                _, task_data = result
                task = json.loads(task_data)

                # Process in background with concurrency limit
                asyncio.create_task(self._process_task(task))

            except Exception as e:
                print(f"Worker error: {e}")
                await asyncio.sleep(1)

    async def _process_task(self, task: dict):
        async with self.semaphore:
            self.active_tasks += 1
            session_id = task["session_id"]
            try:
                # Load session state
                session_mgr = AgentSessionManager(self.redis)
                session = await session_mgr.get_session(session_id)
                if not session:
                    return

                # Create agent instance
                agent = self.agent_factory(
                    agent_type=session.agent_type,
                    llm_pool=self.llm_pool,
                )

                # Execute agent with streaming
                response_parts = []
                async for chunk in agent.run_streaming(
                    message=task["message"],
                    history=session.messages,
                    state=session.state,
                ):
                    response_parts.append(chunk)
                    # Stream partial responses to the user
                    await self.redis.publish(
                        f"agent:response:{session_id}",
                        json.dumps({"type": "chunk", "content": chunk})
                    )

                # Send completion signal
                full_response = "".join(response_parts)
                await self.redis.publish(
                    f"agent:response:{session_id}",
                    json.dumps({"type": "done", "content": full_response})
                )

                # Update session state
                session.messages.append({"role": "user", "content": task["message"]})
                session.messages.append({"role": "assistant", "content": full_response})
                await session_mgr.update_session(session)

            except Exception as e:
                await self.redis.publish(
                    f"agent:response:{session_id}",
                    json.dumps({"type": "error", "content": str(e)})
                )
            finally:
                self.active_tasks -= 1

    def _shutdown(self):
        self.running = False

WebSocket Management at Scale

At 10,000 concurrent users, WebSocket management becomes a significant concern. Each WebSocket connection consumes a file descriptor, memory for buffers, and periodic keepalive bandwidth.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Key strategies for WebSocket scaling:

Connection limits per pod: Set explicit limits (2,000-3,000 connections per pod) and use Kubernetes Horizontal Pod Autoscaler to add pods as connections grow.

Heartbeat and cleanup: Implement server-side heartbeats to detect dead connections. A connection that misses 3 heartbeats should be closed and its resources freed.

Sticky sessions: Use session affinity in the load balancer so that reconnecting clients return to the same pod where their session state is cached in memory.

Graceful degradation: When the system is at capacity, fall back to HTTP long-polling rather than rejecting users outright. Long-polling is less efficient but allows the system to serve more users during peak load.

Load Testing with k6

Load testing agent systems requires simulating realistic multi-turn conversations, not just HTTP request floods. The k6 framework supports WebSocket testing, making it ideal for agent load testing.

// k6 load test for agent WebSocket endpoint

import ws from "k6/ws";
import { check, sleep } from "k6";
import { Counter, Trend } from "k6/metrics";

const responseTime = new Trend("agent_response_time", true);
const errorCount = new Counter("agent_errors");
const messagesProcessed = new Counter("messages_processed");

export const options = {
  scenarios: {
    ramp_to_10k: {
      executor: "ramping-vus",
      startVUs: 100,
      stages: [
        { duration: "2m", target: 1000 },
        { duration: "3m", target: 5000 },
        { duration: "5m", target: 10000 },
        { duration: "10m", target: 10000 },  // Sustain peak
        { duration: "3m", target: 0 },
      ],
    },
  },
  thresholds: {
    agent_response_time: ["p(95)<15000"],  // 95th percentile under 15s
    agent_errors: ["count<100"],
  },
};

const CONVERSATION_TURNS = [
  "What is the status of my last order?",
  "Can you look up order #12345?",
  "I need to change the shipping address",
  "Please update it to 123 Main St, New York, NY 10001",
  "When will it arrive with the new address?",
];

export default function () {
  const url = "wss://api.example.com/ws/agent/customer-support";
  const params = {
    headers: {
      "x-user-id": `load-test-user-${__VU}`,
      Authorization: `Bearer ${__ENV.TEST_TOKEN}`,
    },
  };

  const res = ws.connect(url, params, function (socket) {
    let turnIndex = 0;

    socket.on("open", function () {
      // Send first message
      const start = Date.now();
      socket.send(
        JSON.stringify({ message: CONVERSATION_TURNS[turnIndex] })
      );

      socket.on("message", function (msg) {
        const data = JSON.parse(msg);

        if (data.type === "done") {
          const elapsed = Date.now() - start;
          responseTime.add(elapsed);
          messagesProcessed.add(1);

          turnIndex++;
          if (turnIndex < CONVERSATION_TURNS.length) {
            // Simulate human think time (2-8 seconds)
            sleep(2 + Math.random() * 6);
            socket.send(
              JSON.stringify({ message: CONVERSATION_TURNS[turnIndex] })
            );
          } else {
            socket.close();
          }
        }

        if (data.type === "error") {
          errorCount.add(1);
          socket.close();
        }
      });
    });

    socket.on("error", function (e) {
      errorCount.add(1);
    });

    socket.setTimeout(function () {
      socket.close();
    }, 120000);  // 2-minute timeout per conversation
  });

  check(res, {
    "WebSocket connected": (r) => r && r.status === 101,
  });
}

Performance Benchmarking Metrics

When scaling agent systems, track these metrics:

Time to First Token (TTFT): How long until the user sees the first response chunk. Target: under 2 seconds. This is the perceived responsiveness of the system.

End-to-End Latency: Total time from user message to complete response. Target: under 15 seconds for 95th percentile. Agent responses are inherently slower than API responses, so user expectations are different.

Throughput: Conversations per minute the system can sustain. Measure at steady state, not burst.

Error Rate: Percentage of interactions that fail (timeout, LLM error, tool error). Target: under 1%.

Resource Efficiency: Cost per conversation at peak load. Track LLM API costs, compute costs, and infrastructure costs separately to identify optimization opportunities.

FAQ

How much does it cost to run 10,000 concurrent agent sessions?

The dominant cost is LLM API calls. At 10,000 concurrent users with an average of 5 messages per conversation and 3 LLM calls per message, you are making roughly 150,000 LLM calls per hour at peak. Using a mid-tier model at approximately 3 dollars per million input tokens and 15 dollars per million output tokens, the LLM cost alone is approximately 200-500 dollars per hour depending on context length. Infrastructure costs (compute, Redis, networking) are typically 10-20% of the LLM cost. Model tiering (using cheap models for routing and expensive models for reasoning) can reduce total cost by 40-60%.

Should I use WebSockets or Server-Sent Events for agent streaming?

WebSockets are better when the client needs to send multiple messages during a conversation (multi-turn agents). Server-Sent Events (SSE) are simpler and work better with HTTP/2 when the client sends a single request and receives a streaming response. For most agent use cases, WebSockets are the right choice because conversations are inherently bidirectional.

How do you handle agent state when a pod crashes?

Externalize all session state to Redis or a similar store. The agent worker should be stateless: it loads session state from Redis at the start of each message processing, executes the agent logic, and writes the updated state back. If a pod crashes, the session state is preserved in Redis, and the next message from the user will be picked up by another pod that loads the same state.

What is the optimal number of concurrent agent sessions per worker pod?

This depends on your workload profile, but a good starting point is 50-100 concurrent sessions per pod with 2 CPU cores and 4GB RAM. The limiting factor is usually not CPU or memory but the number of concurrent outbound HTTP connections to LLM APIs. Profile your specific workload with realistic traffic patterns before setting final numbers.