Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide

Ten calls is easy, a thousand is a different animal

A voice agent that handles ten calls on a single pod is a prototype. A voice agent that handles a thousand simultaneous calls is a distributed system with all the problems that come with it — sticky sessions, connection limits, queue back-pressure, graceful drain, regional failover. The transition from ten to a thousand is where most teams ship an outage.

This post walks through the architecture patterns CallSphere uses to scale its voice plane horizontally without losing the sub-second latency budget.

1 pod × 20-40 calls  →  horizontal scaling
50-200 pods          →  sticky routing
sticky routing       →  regional failover
regional failover    →  global queue drain

Architecture overview

┌──────────────────────────────────────┐
│ Twilio / SIP carriers                │
└────────────────┬─────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────┐
│ Global Anycast ingress               │
│ (session affinity by Call SID)       │
└────────────────┬─────────────────────┘
                 │
     ┌───────────┼───────────┐
     ▼           ▼           ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Pod 1   │ │ Pod 2   │ │ Pod N   │
│ 30 calls│ │ 30 calls│ │ 30 calls│
└─────┬───┘ └────┬────┘ └────┬────┘
      │          │           │
      └──────────┴───────────┘
                 │
                 ▼
┌──────────────────────────────────────┐
│ OpenAI Realtime API                  │
│ (org-level concurrent limit)         │
└──────────────────────────────────────┘

Prerequisites

Kubernetes (or equivalent container orchestrator).
An ingress that supports WebSocket session affinity.
Autoscaling based on custom metrics (active calls per pod).
A global control plane for routing and failover.

Step-by-step walkthrough

1. Right-size the per-pod call count

One FastAPI process can handle 20-40 concurrent Realtime sessions before event-loop contention bites. Use that as your per-pod capacity.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: voice-edge
spec:
  replicas: 30
  template:
    spec:
      containers:
        - name: edge
          image: ghcr.io/yourco/voice-edge:latest
          resources:
            requests: {cpu: "1", memory: "1Gi"}
            limits: {cpu: "2", memory: "2Gi"}
          readinessProbe:
            httpGet: {path: /ready, port: 8080}

2. Use sticky routing keyed by Call SID

apiVersion: v1
kind: Service
metadata:
  name: voice-edge
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600

For HTTP ingress, use cookie-based affinity and include the Call SID in the routing header.

3. Scale on active calls, not CPU

CPU is a lagging indicator. Expose an active_calls metric and scale on it directly.

from prometheus_client import Gauge
ACTIVE = Gauge("voice_active_calls", "concurrent calls on this pod")

async def on_call_start():
    ACTIVE.inc()

async def on_call_end():
    ACTIVE.dec()

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: voice-edge-hpa
spec:
  scaleTargetRef: {kind: Deployment, name: voice-edge}
  minReplicas: 10
  maxReplicas: 200
  metrics:
    - type: Pods
      pods:
        metric: {name: voice_active_calls}
        target: {type: AverageValue, averageValue: "25"}

4. Implement graceful drain

On shutdown, stop accepting new calls but keep existing sessions alive until they end or hit a max drain timeout.

import signal
shutting_down = False

def handle_sigterm(*_):
    global shutting_down
    shutting_down = True

signal.signal(signal.SIGTERM, handle_sigterm)

@app.post("/voice")
async def voice(req):
    if shutting_down:
        return Response(status_code=503)
    return accept_call(req)

5. Handle OpenAI concurrent limits

OpenAI rate-limits concurrent Realtime sessions per org. Track usage in Redis and back-pressure at the ingress if you are at the ceiling.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

async def try_reserve_slot() -> bool:
    count = await r.incr("openai:active")
    if count > MAX_ORG_CONCURRENT:
        await r.decr("openai:active")
        return False
    return True

6. Multi-region for disaster recovery

Run the full stack in two regions. Use Twilio's regional endpoints and Anycast DNS for failover.

Production considerations

Connection pooling: keep HTTP clients alive across calls; do not recreate per session.
Memory: audio buffers and transcripts grow during long calls; cap them.
Queue depth: post-call workers must drain faster than inflow.
Chaos testing: kill pods under load; make sure ongoing calls survive failover.
Observability: p95 latency per pod, queue depth, OpenAI quota usage.

CallSphere's real implementation

CallSphere's voice edge runs on Kubernetes with FastAPI pods co-located with Twilio's media regions. Each pod handles 20-40 concurrent Realtime sessions using gpt-4o-realtime-preview-2025-06-03 at 24kHz PCM16 with server VAD. Autoscaling is driven by the active_calls Prometheus metric, graceful drain is wired to SIGTERM, and OpenAI org-level concurrency is tracked in Redis so back-pressure kicks in before the API returns 429s.

The multi-agent verticals — 14 healthcare tools, 10 real estate agents, 4 salon agents, 7 after-hours escalation tools, 10 plus RAG IT helpdesk tools, and the 5-specialist ElevenLabs sales pod — all share the same edge plane, distinguished only by which tool schema they load at session setup. OpenAI Agents SDK handoffs stay inside one session, so scaling doesn't break multi-agent handoffs. CallSphere supports 57+ languages and sub-second end-to-end latency at scale.

Common pitfalls

Scaling on CPU: you will under-provision under bursty voice load.
Re-creating HTTP clients per call: socket exhaustion.
No graceful drain: rolling deploys will kill live calls.
Single region: a regional outage = full outage.
Skipping rate-limit awareness: you will hit OpenAI 429s in production.

FAQ

How many pods do I need for 1000 concurrent calls?

At 25 calls/pod, about 40 pods plus 20% headroom.

What about stateful DB connections?

Use pgbouncer or a managed pool; do not open per-call.

Can I run this on Fargate or Cloud Run?

Fargate yes, Cloud Run no — it does not support long-lived WebSockets reliably.

What is the bottleneck past 1000 calls?

Usually OpenAI quota and DB connections, not CPU.

How do I test scaling?

Use a WebSocket load generator that simulates Twilio Media Streams.

Next steps

Planning a high-concurrency rollout? Book a demo, explore the technology page, or compare pricing.

#CallSphere #Scaling #Kubernetes #VoiceAI #Performance #Architecture #AIVoiceAgents