Twilio + AI Voice Agent Setup Guide: End-to-End Production Architecture

The gap between "hello world" and production

Twilio's quickstart will get you a phone number and a TwiML Bin that reads "hello world" in about five minutes. That is a demo, not a product. A production AI voice agent on Twilio has to answer inbound calls, open a bidirectional media stream to your LLM, survive carrier hiccups, record for compliance, and write every call into a database — all without the caller hearing a single glitch.

This guide walks through the exact wiring, from buying a number to running a bidirectional Media Streams bridge that pipes audio into the OpenAI Realtime API. Every snippet below is written to match what CallSphere runs in production for its healthcare, real estate, and sales verticals.

PSTN caller
   │
   ▼
Twilio Number  ──TwiML──►  your /voice webhook
   │
   ▼
<Start><Stream url="wss://edge.yourapp.com/twilio" />
   │
   ▼
FastAPI edge  ←──PCM16──►  OpenAI Realtime API
   │
   ▼
Postgres (call log)   Queue (post-call analytics)

Architecture overview

┌──────────────┐   TwiML    ┌──────────────┐
│ Twilio Voice │──────────► │ /voice route │
└──────────────┘            └──────┬───────┘
       │                           │ <Stream>
       ▼                           ▼
┌──────────────────────────────────────────┐
│ FastAPI edge (WebSocket /twilio/stream)  │
│ • ulaw↔pcm16 resampler                   │
│ • speech-started interruption            │
│ • tool dispatcher                        │
└─────────────┬────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────┐
│ OpenAI Realtime API                      │
└──────────────────────────────────────────┘

Prerequisites

A Twilio account with a verified phone number.
Access to the OpenAI Realtime API.
A publicly reachable HTTPS endpoint for the /voice webhook and a wss:// endpoint for Media Streams.
Python 3.11+ or Node 20+.
A Postgres database (we use per-vertical schemas; any single instance is fine to start).

Step-by-step walkthrough

1. Buy a number and point it at your webhook

In the Twilio console, buy a number with Voice capability. Set the "A call comes in" webhook to POST https://edge.yourapp.com/voice. Add a fallback URL so you degrade gracefully when your service is down.

2. Return TwiML that opens a Media Stream

The /voice endpoint responds with TwiML that starts a bidirectional stream. track="inbound_track" sends caller audio only; use both_tracks if you need to record both sides.

from fastapi import FastAPI, Response, Request

app = FastAPI()

@app.post("/voice")
async def voice(req: Request):
    host = req.url.hostname
    twiml = f"""
    <Response>
      <Connect>
        <Stream url="wss://{host}/twilio/stream" />
      </Connect>
    </Response>
    """.strip()
    return Response(content=twiml, media_type="application/xml")

3. Run the bidirectional bridge

Twilio sends G.711 ulaw frames at 8kHz over JSON messages. You convert to PCM16 at 24kHz before forwarding to OpenAI, and convert back on the return path.

import audioop, base64, json
from fastapi import WebSocket

def ulaw_to_pcm16_24k(ulaw_bytes: bytes) -> bytes:
    pcm8k = audioop.ulaw2lin(ulaw_bytes, 2)
    pcm24k, _ = audioop.ratecv(pcm8k, 2, 1, 8000, 24000, None)
    return pcm24k

def pcm16_24k_to_ulaw_b64(pcm24k_b64: str) -> str:
    pcm24k = base64.b64decode(pcm24k_b64)
    pcm8k, _ = audioop.ratecv(pcm24k, 2, 1, 24000, 8000, None)
    return base64.b64encode(audioop.lin2ulaw(pcm8k, 2)).decode()

4. Log every call to Postgres

Do not rely on Twilio's call logs alone. Create your own calls table with the Twilio Call SID, your internal call ID, and a pointer to the transcript blob.

async def log_call_start(call_sid: str, from_: str, to: str):
    await db.execute(
        "INSERT INTO calls (call_sid, from_number, to_number, started_at) "
        "VALUES ($1, $2, $3, now())",
        call_sid, from_, to,
    )

5. Handle call recording for compliance

Add <Record> to TwiML or use the REST API to start recording mid-call. Store the recording URL in your calls table and gate playback through signed URLs.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

6. Deploy behind a sticky load balancer

Media Streams WebSockets must land on the same pod for the duration of the call. Use session affinity in your ingress (nginx.ingress.kubernetes.io/affinity: "cookie" or equivalent).

Production considerations

Webhook signature validation: Twilio signs every request. Reject unsigned calls.
HTTPS everywhere: Twilio will not talk to a mixed content endpoint.
Idempotency: retries happen. Key your database writes by Call SID.
Cost controls: set a <Pause> timeout and max call length to prevent runaway sessions.
Fallback: configure the Twilio fallback URL to route to a plain IVR if your edge is down.

CallSphere's real implementation

CallSphere uses this exact Twilio wiring across every production vertical. The edge is a Python FastAPI service that bridges Twilio Media Streams to the OpenAI Realtime API with gpt-4o-realtime-preview-2025-06-03, server VAD, and PCM16 at 24kHz. Call metadata is written to per-vertical Postgres databases and a GPT-4o-mini worker handles post-call sentiment, intent, and lead scoring asynchronously.

For multi-agent verticals — 14 tools for healthcare, 10 for real estate, 4 for salon, 7 for after-hours escalation, 10 plus RAG for IT helpdesk, and an ElevenLabs sales pod with 5 GPT-4 specialists — handoffs use the OpenAI Agents SDK while the Twilio leg stays the same. The entire stack supports 57+ languages and delivers under one second end-to-end response time.

Common pitfalls

Using <Dial> instead of <Connect><Stream>: <Dial> bridges to another number, not a WebSocket.
Forgetting to upsample to 24kHz: the model accepts 24kHz PCM16; 8kHz audio degrades recognition noticeably.
Letting the WebSocket block on DB writes: always fire-and-forget to a queue.
Not validating the Twilio signature: public webhooks are a classic attack surface.
Hardcoding the host in TwiML: use the request hostname so staging and prod share code.
Skipping the fallback URL: a silent dead call is the worst possible failure mode.

FAQ

Do I need Twilio SIP Trunking or is a regular phone number enough?

For most SMB use cases a Twilio phone number with Media Streams is enough. You only need SIP Trunking when you are porting existing DIDs or bridging to an on-prem PBX.

How do I test Media Streams locally?

Use ngrok to expose both your HTTP and WSS endpoints. Twilio requires TLS, so plain http tunnels do not work.

Can I run this on serverless?

Not cleanly. Long-lived WebSockets do not fit the typical serverless lifecycle. Run the edge on a long-running container.

How do I handle call transfer to a human?

Use the <Dial> verb from a mid-call update REST call or hand off through the OpenAI Agents SDK to a specialist agent.

What is the right number of concurrent calls per edge instance?

Start at 20 per 1 vCPU and measure. Event-loop contention is the bottleneck long before CPU.

Next steps

Want to see a complete Twilio + Realtime deployment running live? Book a demo, read the technology page, or compare plans on the pricing page.

#CallSphere #Twilio #AIVoiceAgents #MediaStreams #FastAPI #RealtimeAPI #Production