Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide
Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management.
Ten calls is easy, a thousand is a different animal
A voice agent that handles ten calls on a single pod is a prototype. A voice agent that handles a thousand simultaneous calls is a distributed system with all the problems that come with it — sticky sessions, connection limits, queue back-pressure, graceful drain, regional failover. The transition from ten to a thousand is where most teams ship an outage.
This post walks through the architecture patterns CallSphere uses to scale its voice plane horizontally without losing the sub-second latency budget.
1 pod × 20-40 calls → horizontal scaling
50-200 pods → sticky routing
sticky routing → regional failover
regional failover → global queue drain
Architecture overview
┌──────────────────────────────────────┐
│ Twilio / SIP carriers │
└────────────────┬─────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Global Anycast ingress │
│ (session affinity by Call SID) │
└────────────────┬─────────────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Pod 1 │ │ Pod 2 │ │ Pod N │
│ 30 calls│ │ 30 calls│ │ 30 calls│
└─────┬───┘ └────┬────┘ └────┬────┘
│ │ │
└──────────┴───────────┘
│
▼
┌──────────────────────────────────────┐
│ OpenAI Realtime API │
│ (org-level concurrent limit) │
└──────────────────────────────────────┘
Prerequisites
- Kubernetes (or equivalent container orchestrator).
- An ingress that supports WebSocket session affinity.
- Autoscaling based on custom metrics (active calls per pod).
- A global control plane for routing and failover.
Step-by-step walkthrough
1. Right-size the per-pod call count
One FastAPI process can handle 20-40 concurrent Realtime sessions before event-loop contention bites. Use that as your per-pod capacity.
apiVersion: apps/v1
kind: Deployment
metadata:
name: voice-edge
spec:
replicas: 30
template:
spec:
containers:
- name: edge
image: ghcr.io/yourco/voice-edge:latest
resources:
requests: {cpu: "1", memory: "1Gi"}
limits: {cpu: "2", memory: "2Gi"}
readinessProbe:
httpGet: {path: /ready, port: 8080}
2. Use sticky routing keyed by Call SID
apiVersion: v1
kind: Service
metadata:
name: voice-edge
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600
For HTTP ingress, use cookie-based affinity and include the Call SID in the routing header.
3. Scale on active calls, not CPU
CPU is a lagging indicator. Expose an active_calls metric and scale on it directly.
from prometheus_client import Gauge
ACTIVE = Gauge("voice_active_calls", "concurrent calls on this pod")
async def on_call_start():
ACTIVE.inc()
async def on_call_end():
ACTIVE.dec()
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: voice-edge-hpa
spec:
scaleTargetRef: {kind: Deployment, name: voice-edge}
minReplicas: 10
maxReplicas: 200
metrics:
- type: Pods
pods:
metric: {name: voice_active_calls}
target: {type: AverageValue, averageValue: "25"}
4. Implement graceful drain
On shutdown, stop accepting new calls but keep existing sessions alive until they end or hit a max drain timeout.
import signal
shutting_down = False
def handle_sigterm(*_):
global shutting_down
shutting_down = True
signal.signal(signal.SIGTERM, handle_sigterm)
@app.post("/voice")
async def voice(req):
if shutting_down:
return Response(status_code=503)
return accept_call(req)
5. Handle OpenAI concurrent limits
OpenAI rate-limits concurrent Realtime sessions per org. Track usage in Redis and back-pressure at the ingress if you are at the ceiling.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
async def try_reserve_slot() -> bool:
count = await r.incr("openai:active")
if count > MAX_ORG_CONCURRENT:
await r.decr("openai:active")
return False
return True
6. Multi-region for disaster recovery
Run the full stack in two regions. Use Twilio's regional endpoints and Anycast DNS for failover.
Production considerations
- Connection pooling: keep HTTP clients alive across calls; do not recreate per session.
- Memory: audio buffers and transcripts grow during long calls; cap them.
- Queue depth: post-call workers must drain faster than inflow.
- Chaos testing: kill pods under load; make sure ongoing calls survive failover.
- Observability: p95 latency per pod, queue depth, OpenAI quota usage.
CallSphere's real implementation
CallSphere's voice edge runs on Kubernetes with FastAPI pods co-located with Twilio's media regions. Each pod handles 20-40 concurrent Realtime sessions using gpt-4o-realtime-preview-2025-06-03 at 24kHz PCM16 with server VAD. Autoscaling is driven by the active_calls Prometheus metric, graceful drain is wired to SIGTERM, and OpenAI org-level concurrency is tracked in Redis so back-pressure kicks in before the API returns 429s.
The multi-agent verticals — 14 healthcare tools, 10 real estate agents, 4 salon agents, 7 after-hours escalation tools, 10 plus RAG IT helpdesk tools, and the 5-specialist ElevenLabs sales pod — all share the same edge plane, distinguished only by which tool schema they load at session setup. OpenAI Agents SDK handoffs stay inside one session, so scaling doesn't break multi-agent handoffs. CallSphere supports 57+ languages and sub-second end-to-end latency at scale.
Common pitfalls
- Scaling on CPU: you will under-provision under bursty voice load.
- Re-creating HTTP clients per call: socket exhaustion.
- No graceful drain: rolling deploys will kill live calls.
- Single region: a regional outage = full outage.
- Skipping rate-limit awareness: you will hit OpenAI 429s in production.
FAQ
How many pods do I need for 1000 concurrent calls?
At 25 calls/pod, about 40 pods plus 20% headroom.
What about stateful DB connections?
Use pgbouncer or a managed pool; do not open per-call.
Can I run this on Fargate or Cloud Run?
Fargate yes, Cloud Run no — it does not support long-lived WebSockets reliably.
What is the bottleneck past 1000 calls?
Usually OpenAI quota and DB connections, not CPU.
How do I test scaling?
Use a WebSocket load generator that simulates Twilio Media Streams.
Next steps
Planning a high-concurrency rollout? Book a demo, explore the technology page, or compare pricing.
#CallSphere #Scaling #Kubernetes #VoiceAI #Performance #Architecture #AIVoiceAgents
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.