Twilio Voice Integration for AI Agents: Building Phone-Based AI Assistants

Why Twilio for AI Voice Agents

Twilio is the most widely adopted cloud communications platform, providing programmable access to the global phone network. For AI agent developers, Twilio offers the critical bridge between an intelligent language model and a traditional phone call — letting your agent answer calls, speak to callers, and process their responses in real time.

The key components you will work with are Twilio Voice (call control), TwiML (telephony markup), and Media Streams (raw audio access). Together, these let you build AI assistants that are indistinguishable from human operators on a phone call.

Setting Up the Twilio Environment

Start by installing the required packages and configuring your Twilio account:

import os
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Connect

account_sid = os.environ["TWILIO_ACCOUNT_SID"]
auth_token = os.environ["TWILIO_AUTH_TOKEN"]
client = Client(account_sid, auth_token)

# Purchase or use an existing phone number
phone_number = client.incoming_phone_numbers.list(limit=1)[0]
print(f"Using number: {phone_number.phone_number}")

You need a publicly accessible webhook URL that Twilio will call when a phone call arrives. In production, this is your server's domain. During development, tools like ngrok create a tunnel to your local machine.

Handling Incoming Calls with TwiML

When someone calls your Twilio number, Twilio sends an HTTP request to your webhook. You respond with TwiML — an XML dialect that controls call behavior:

from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/incoming-call")
async def handle_incoming_call(request: Request):
    """Webhook that Twilio hits when a call arrives."""
    form_data = await request.form()
    caller = form_data.get("From", "Unknown")

    response = VoiceResponse()
    response.say(
        "Hello! You have reached our AI assistant. "
        "How can I help you today?",
        voice="Polly.Joanna",
        language="en-US",
    )

    # Gather speech input from the caller
    gather = response.gather(
        input="speech",
        action="/process-speech",
        speech_timeout="auto",
        language="en-US",
    )
    gather.say("I am listening.")

    # Fallback if no input detected
    response.say("I did not hear anything. Goodbye.")
    response.hangup()

    return Response(
        content=str(response),
        media_type="application/xml",
    )

The <Gather> verb captures the caller's speech and sends it as text to your /process-speech endpoint, where you can feed it to an AI model.

Bi-Directional Audio with Media Streams

For real-time AI interaction — where the agent can interrupt, respond with low latency, and process audio continuously — you need Twilio Media Streams. This opens a WebSocket connection that streams raw audio in both directions:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import json
import base64
import asyncio
import websockets

@app.post("/media-stream-call")
async def media_stream_call(request: Request):
    """Route the call into a WebSocket media stream."""
    response = VoiceResponse()
    connect = Connect()
    stream = connect.stream(url="wss://yourdomain.com/media-socket")
    stream.parameter(name="caller_id", value="agent-001")
    response.append(connect)
    return Response(content=str(response), media_type="application/xml")


async def handle_media_socket(websocket):
    """Process bi-directional audio over WebSocket."""
    stream_sid = None

    async for message in websocket:
        data = json.loads(message)
        event_type = data.get("event")

        if event_type == "start":
            stream_sid = data["start"]["streamSid"]
            print(f"Stream started: {stream_sid}")

        elif event_type == "media":
            # Incoming audio from caller (mulaw 8kHz)
            audio_payload = base64.b64decode(data["media"]["payload"])
            # Send audio to your STT engine for transcription
            transcript = await transcribe_audio_chunk(audio_payload)

            if transcript:
                # Generate AI response and convert to audio
                ai_response = await get_ai_response(transcript)
                audio_bytes = await text_to_speech(ai_response)

                # Send audio back to the caller
                media_message = {
                    "event": "media",
                    "streamSid": stream_sid,
                    "media": {
                        "payload": base64.b64encode(audio_bytes).decode()
                    },
                }
                await websocket.send(json.dumps(media_message))

        elif event_type == "stop":
            print("Stream ended")
            break

Media Streams deliver audio as base64-encoded mulaw at 8000 Hz. Your pipeline must decode this, run speech-to-text, generate the AI response, synthesize speech, re-encode to mulaw, and send it back — all within a few hundred milliseconds for natural conversation flow.

Configuring the Webhook on Your Twilio Number

Once your server is running, point your Twilio phone number at it:

phone_number = client.incoming_phone_numbers.list(limit=1)[0]
client.incoming_phone_numbers(phone_number.sid).update(
    voice_url="https://yourdomain.com/incoming-call",
    voice_method="POST",
    status_callback="https://yourdomain.com/call-status",
    status_callback_method="POST",
)
print(f"Webhook configured for {phone_number.phone_number}")

The status_callback URL receives events like call initiation, ringing, answered, and completed — useful for logging and analytics.

Production Considerations

For production deployments, implement these patterns: use connection pooling for your WebSocket server, handle stream reconnections gracefully, implement silence detection to avoid sending empty audio to your STT engine, and add timeout handling for calls where the caller goes silent for too long. Monitor your Twilio usage closely — Media Streams are billed per minute of active streaming.

FAQ

What audio format does Twilio Media Streams use?

Twilio Media Streams deliver audio as base64-encoded mulaw (G.711 u-law) at 8000 Hz, mono channel. When sending audio back, you must encode in the same format. Most speech-to-text engines accept mulaw directly, but text-to-speech output often needs conversion from PCM or MP3 to mulaw before sending.

Can I use Twilio Gather instead of Media Streams for simpler use cases?

Yes. The <Gather> TwiML verb with input="speech" handles speech recognition on Twilio's side and delivers transcribed text to your webhook. This is simpler to implement but adds latency (typically 1-3 seconds) and does not support real-time interruption. Use Gather for simple menu navigation and Media Streams for conversational AI.

How do I handle concurrent calls on the same Twilio number?

Twilio automatically handles concurrent calls — each call gets its own webhook request and its own WebSocket stream. Your server must be stateless per-call (use the CallSid as a session key) and handle concurrency. In production, run multiple server instances behind a load balancer and use Redis or similar for shared call state.

#Twilio #VoiceAI #Telephony #MediaStreams #Python #AIAgents #AgenticAI #LearnAI #AIEngineering

Twilio Voice Integration for AI Agents: Building Phone-Based AI Assistants

Why Twilio for AI Voice Agents

Setting Up the Twilio Environment

Handling Incoming Calls with TwiML

Bi-Directional Audio with Media Streams

Configuring the Webhook on Your Twilio Number

Production Considerations

FAQ

What audio format does Twilio Media Streams use?

Can I use Twilio Gather instead of Media Streams for simpler use cases?

How do I handle concurrent calls on the same Twilio number?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding