Twilio Voice Integration for AI Agents: Building Phone-Based AI Assistants
Learn how to connect AI agents to the phone network using Twilio Voice, TwiML, and Media Streams. Covers bi-directional audio, real-time speech processing, and production deployment patterns.
Why Twilio for AI Voice Agents
Twilio is the most widely adopted cloud communications platform, providing programmable access to the global phone network. For AI agent developers, Twilio offers the critical bridge between an intelligent language model and a traditional phone call — letting your agent answer calls, speak to callers, and process their responses in real time.
The key components you will work with are Twilio Voice (call control), TwiML (telephony markup), and Media Streams (raw audio access). Together, these let you build AI assistants that are indistinguishable from human operators on a phone call.
Setting Up the Twilio Environment
Start by installing the required packages and configuring your Twilio account:
import os
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Connect
account_sid = os.environ["TWILIO_ACCOUNT_SID"]
auth_token = os.environ["TWILIO_AUTH_TOKEN"]
client = Client(account_sid, auth_token)
# Purchase or use an existing phone number
phone_number = client.incoming_phone_numbers.list(limit=1)[0]
print(f"Using number: {phone_number.phone_number}")
You need a publicly accessible webhook URL that Twilio will call when a phone call arrives. In production, this is your server's domain. During development, tools like ngrok create a tunnel to your local machine.
Handling Incoming Calls with TwiML
When someone calls your Twilio number, Twilio sends an HTTP request to your webhook. You respond with TwiML — an XML dialect that controls call behavior:
from fastapi import FastAPI, Request
from fastapi.responses import Response
app = FastAPI()
@app.post("/incoming-call")
async def handle_incoming_call(request: Request):
"""Webhook that Twilio hits when a call arrives."""
form_data = await request.form()
caller = form_data.get("From", "Unknown")
response = VoiceResponse()
response.say(
"Hello! You have reached our AI assistant. "
"How can I help you today?",
voice="Polly.Joanna",
language="en-US",
)
# Gather speech input from the caller
gather = response.gather(
input="speech",
action="/process-speech",
speech_timeout="auto",
language="en-US",
)
gather.say("I am listening.")
# Fallback if no input detected
response.say("I did not hear anything. Goodbye.")
response.hangup()
return Response(
content=str(response),
media_type="application/xml",
)
The <Gather> verb captures the caller's speech and sends it as text to your /process-speech endpoint, where you can feed it to an AI model.
Bi-Directional Audio with Media Streams
For real-time AI interaction — where the agent can interrupt, respond with low latency, and process audio continuously — you need Twilio Media Streams. This opens a WebSocket connection that streams raw audio in both directions:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import json
import base64
import asyncio
import websockets
@app.post("/media-stream-call")
async def media_stream_call(request: Request):
"""Route the call into a WebSocket media stream."""
response = VoiceResponse()
connect = Connect()
stream = connect.stream(url="wss://yourdomain.com/media-socket")
stream.parameter(name="caller_id", value="agent-001")
response.append(connect)
return Response(content=str(response), media_type="application/xml")
async def handle_media_socket(websocket):
"""Process bi-directional audio over WebSocket."""
stream_sid = None
async for message in websocket:
data = json.loads(message)
event_type = data.get("event")
if event_type == "start":
stream_sid = data["start"]["streamSid"]
print(f"Stream started: {stream_sid}")
elif event_type == "media":
# Incoming audio from caller (mulaw 8kHz)
audio_payload = base64.b64decode(data["media"]["payload"])
# Send audio to your STT engine for transcription
transcript = await transcribe_audio_chunk(audio_payload)
if transcript:
# Generate AI response and convert to audio
ai_response = await get_ai_response(transcript)
audio_bytes = await text_to_speech(ai_response)
# Send audio back to the caller
media_message = {
"event": "media",
"streamSid": stream_sid,
"media": {
"payload": base64.b64encode(audio_bytes).decode()
},
}
await websocket.send(json.dumps(media_message))
elif event_type == "stop":
print("Stream ended")
break
Media Streams deliver audio as base64-encoded mulaw at 8000 Hz. Your pipeline must decode this, run speech-to-text, generate the AI response, synthesize speech, re-encode to mulaw, and send it back — all within a few hundred milliseconds for natural conversation flow.
Configuring the Webhook on Your Twilio Number
Once your server is running, point your Twilio phone number at it:
phone_number = client.incoming_phone_numbers.list(limit=1)[0]
client.incoming_phone_numbers(phone_number.sid).update(
voice_url="https://yourdomain.com/incoming-call",
voice_method="POST",
status_callback="https://yourdomain.com/call-status",
status_callback_method="POST",
)
print(f"Webhook configured for {phone_number.phone_number}")
The status_callback URL receives events like call initiation, ringing, answered, and completed — useful for logging and analytics.
Production Considerations
For production deployments, implement these patterns: use connection pooling for your WebSocket server, handle stream reconnections gracefully, implement silence detection to avoid sending empty audio to your STT engine, and add timeout handling for calls where the caller goes silent for too long. Monitor your Twilio usage closely — Media Streams are billed per minute of active streaming.
FAQ
What audio format does Twilio Media Streams use?
Twilio Media Streams deliver audio as base64-encoded mulaw (G.711 u-law) at 8000 Hz, mono channel. When sending audio back, you must encode in the same format. Most speech-to-text engines accept mulaw directly, but text-to-speech output often needs conversion from PCM or MP3 to mulaw before sending.
Can I use Twilio Gather instead of Media Streams for simpler use cases?
Yes. The <Gather> TwiML verb with input="speech" handles speech recognition on Twilio's side and delivers transcribed text to your webhook. This is simpler to implement but adds latency (typically 1-3 seconds) and does not support real-time interruption. Use Gather for simple menu navigation and Media Streams for conversational AI.
How do I handle concurrent calls on the same Twilio number?
Twilio automatically handles concurrent calls — each call gets its own webhook request and its own WebSocket stream. Your server must be stateless per-call (use the CallSid as a session key) and handle concurrency. In production, run multiple server instances behind a load balancer and use Redis or similar for shared call state.
#Twilio #VoiceAI #Telephony #MediaStreams #Python #AIAgents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.