Building Real-Time Voice Agents with OpenAI Realtime API and WebRTC in 2026
Step-by-step tutorial on building production voice agents using OpenAI's Realtime API with WebRTC, server VAD, PCM16 audio streaming, and Twilio telephony integration.
Why the OpenAI Realtime API Changes Voice Agent Development
Before the Realtime API, building a voice agent required stitching together three separate services: a speech-to-text provider, an LLM for reasoning, and a text-to-speech provider. Each hop added 200-400ms of latency. A typical pipeline hit 1.2-2 seconds of total response time — noticeable enough to break conversational flow.
The OpenAI Realtime API collapses this into a single WebSocket or WebRTC connection. Raw audio goes in, reasoned audio comes out. The model handles speech recognition, reasoning, and speech synthesis internally using GPT-4o's multimodal capabilities. Total response latency drops to 300-500ms, which falls within the range of natural human conversation pauses.
This tutorial walks through building a production voice agent from scratch using the Realtime API with WebRTC for browser-based interactions and Twilio for telephone integration.
Architecture Overview
The system has three components: a browser client using WebRTC, a backend server that manages sessions and ephemeral tokens, and the OpenAI Realtime API endpoint.
// Architecture flow:
// Browser (WebRTC) <-> OpenAI Realtime API (gpt-4o-realtime)
// |
// Function calls
// |
// Your Backend Server
// (tool execution, DB, etc.)
WebRTC provides the transport layer. The browser captures microphone audio, sends it to OpenAI's servers via a peer connection, and receives synthesized audio back. Your backend server handles ephemeral token generation and tool execution when the model calls functions.
Step 1: Generate an Ephemeral Token
Never expose your OpenAI API key to the browser. Instead, create a short-lived ephemeral token on your backend.
// server/routes/session.ts
import express from "express";
const router = express.Router();
router.post("/api/session", async (req, res) => {
const { voice = "alloy", instructions } = req.body;
try {
const response = await fetch(
"https://api.openai.com/v1/realtime/sessions",
{
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "gpt-4o-realtime-preview-2026-01-21",
voice,
modalities: ["text", "audio"],
instructions:
instructions ||
"You are a helpful customer service agent for CallSphere. " +
"Be concise and professional. Ask clarifying questions when needed.",
turn_detection: {
type: "server_vad",
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 600,
},
tools: [
{
type: "function",
name: "lookup_customer",
description: "Look up a customer by phone number or account ID",
parameters: {
type: "object",
properties: {
phone: { type: "string", description: "Customer phone number" },
account_id: { type: "string", description: "Account ID" },
},
},
},
{
type: "function",
name: "schedule_appointment",
description: "Schedule an appointment for the customer",
parameters: {
type: "object",
properties: {
customer_id: { type: "string" },
date: { type: "string", description: "ISO 8601 date" },
time: { type: "string", description: "HH:MM format" },
service_type: { type: "string" },
},
required: ["customer_id", "date", "time", "service_type"],
},
},
],
}),
}
);
const data = await response.json();
// data.client_secret.value contains the ephemeral token
res.json({
token: data.client_secret.value,
expires_at: data.client_secret.expires_at,
});
} catch (error) {
console.error("Session creation failed:", error);
res.status(500).json({ error: "Failed to create session" });
}
});
export default router;
The ephemeral token expires after 60 seconds — enough time for the browser to establish the WebRTC connection, after which the token is no longer needed.
Step 2: Establish the WebRTC Connection
On the browser side, use the ephemeral token to create a peer connection directly to OpenAI.
// client/voice-agent.ts
class VoiceAgent {
private pc: RTCPeerConnection | null = null;
private dc: RTCDataChannel | null = null;
private audioElement: HTMLAudioElement;
constructor() {
this.audioElement = document.createElement("audio");
this.audioElement.autoplay = true;
}
async connect(): Promise<void> {
// Step 1: Get ephemeral token from our backend
const sessionRes = await fetch("/api/session", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
voice: "alloy",
instructions: "You are a helpful voice assistant.",
}),
});
const { token } = await sessionRes.json();
// Step 2: Create peer connection
this.pc = new RTCPeerConnection();
// Step 3: Set up audio playback for model responses
this.pc.ontrack = (event) => {
this.audioElement.srcObject = event.streams[0];
};
// Step 4: Capture microphone and add track
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
stream.getTracks().forEach((track) => {
this.pc!.addTrack(track, stream);
});
// Step 5: Create data channel for events (function calls, transcripts)
this.dc = this.pc.createDataChannel("oai-events");
this.dc.onmessage = (event) => this.handleServerEvent(JSON.parse(event.data));
// Step 6: Create and set local offer
const offer = await this.pc.createOffer();
await this.pc.setLocalDescription(offer);
// Step 7: Send offer to OpenAI, get answer
const sdpResponse = await fetch(
"https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2026-01-21",
{
method: "POST",
headers: {
Authorization: `Bearer ${token}`,
"Content-Type": "application/sdp",
},
body: offer.sdp,
}
);
const answerSdp = await sdpResponse.text();
await this.pc.setRemoteDescription({ type: "answer", sdp: answerSdp });
console.log("WebRTC connection established");
}
private handleServerEvent(event: any): void {
switch (event.type) {
case "response.function_call_arguments.done":
this.executeFunction(event);
break;
case "conversation.item.input_audio_transcription.completed":
console.log("User said:", event.transcript);
break;
case "response.audio_transcript.done":
console.log("Agent said:", event.transcript);
break;
case "error":
console.error("Realtime API error:", event.error);
break;
}
}
private async executeFunction(event: any): void {
const { name, arguments: args, call_id } = event;
let result: any;
try {
// Execute the function on your backend
const response = await fetch(`/api/tools/${name}`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: args,
});
result = await response.json();
} catch (error) {
result = { error: "Tool execution failed" };
}
// Send the result back through the data channel
this.dc?.send(
JSON.stringify({
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id,
output: JSON.stringify(result),
},
})
);
// Trigger the model to continue responding
this.dc?.send(JSON.stringify({ type: "response.create" }));
}
disconnect(): void {
this.dc?.close();
this.pc?.close();
this.pc = null;
this.dc = null;
}
}
Step 3: Server VAD Configuration
Server-side Voice Activity Detection (VAD) is what makes the conversation feel natural. The model listens for speech, detects when the user stops talking, and automatically generates a response.
The three critical VAD parameters are:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
- threshold (0.0-1.0): Sensitivity for detecting speech. Lower values detect quieter speech but increase false positives from background noise. Default 0.5 works for most environments.
- prefix_padding_ms: How many milliseconds of audio before detected speech to include. 300ms captures the beginning of words that might otherwise be clipped.
- silence_duration_ms: How long the user must be silent before the model considers the turn complete. 500-700ms is the sweet spot — shorter causes premature cutoffs, longer feels sluggish.
# Python example: Tuning VAD for different environments
vad_configs = {
"quiet_office": {
"type": "server_vad",
"threshold": 0.4,
"prefix_padding_ms": 200,
"silence_duration_ms": 500,
},
"noisy_call_center": {
"type": "server_vad",
"threshold": 0.7,
"prefix_padding_ms": 400,
"silence_duration_ms": 700,
},
"phone_line": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 600,
},
}
Step 4: Twilio Integration for Phone Calls
For telephone-based voice agents, Twilio provides the bridge between PSTN phone calls and your WebSocket-based voice agent. The flow is: caller dials your Twilio number, Twilio opens a WebSocket media stream to your server, your server relays audio between Twilio and OpenAI.
# server/twilio_handler.py
import json
import base64
import asyncio
import websockets
from fastapi import FastAPI, WebSocket
from twilio.twiml.voice_response import VoiceResponse, Connect
app = FastAPI()
OPENAI_WS_URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2026-01-21"
@app.post("/twilio/incoming")
async def handle_incoming_call():
"""Twilio webhook: return TwiML that connects to our WebSocket."""
response = VoiceResponse()
connect = Connect()
connect.stream(
url=f"wss://{os.environ['SERVER_HOST']}/twilio/media-stream"
)
response.append(connect)
return str(response)
@app.websocket("/twilio/media-stream")
async def media_stream(ws: WebSocket):
"""Bridge between Twilio media stream and OpenAI Realtime API."""
await ws.accept()
headers = {
"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
"OpenAI-Beta": "realtime=v1",
}
async with websockets.connect(OPENAI_WS_URL, extra_headers=headers) as openai_ws:
stream_sid = None
# Configure the session
await openai_ws.send(json.dumps({
"type": "session.update",
"session": {
"voice": "alloy",
"instructions": "You are a phone-based customer service agent.",
"input_audio_format": "g711_ulaw",
"output_audio_format": "g711_ulaw",
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"silence_duration_ms": 600,
},
},
}))
async def relay_twilio_to_openai():
"""Forward Twilio audio to OpenAI."""
nonlocal stream_sid
async for message in ws.iter_text():
data = json.loads(message)
if data["event"] == "media":
await openai_ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": data["media"]["payload"],
}))
elif data["event"] == "start":
stream_sid = data["start"]["streamSid"]
async def relay_openai_to_twilio():
"""Forward OpenAI audio to Twilio."""
async for message in openai_ws:
event = json.loads(message)
if event["type"] == "response.audio.delta":
await ws.send_json({
"event": "media",
"streamSid": stream_sid,
"media": {"payload": event["delta"]},
})
elif event["type"] == "response.function_call_arguments.done":
result = await execute_tool(event["name"], event["arguments"])
await openai_ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "function_call_output",
"call_id": event["call_id"],
"output": json.dumps(result),
},
}))
await openai_ws.send(json.dumps({"type": "response.create"}))
await asyncio.gather(
relay_twilio_to_openai(),
relay_openai_to_twilio(),
)
Note the audio format: Twilio uses G.711 u-law encoding, so you must set input_audio_format and output_audio_format to g711_ulaw. The Realtime API handles the conversion internally.
Step 5: Handling Interruptions
Natural conversations involve interruptions. The Realtime API handles this through the response.cancel event. When server VAD detects the user speaking while the model is generating audio, it automatically truncates the current response.
Your client needs to handle the truncation gracefully:
// In handleServerEvent:
case "response.audio.done":
// Response completed normally
this.updateUI({ status: "listening" });
break;
case "input_audio_buffer.speech_started":
// User started speaking — model will auto-truncate if responding
this.updateUI({ status: "user_speaking" });
break;
case "response.cancelled":
// Model response was interrupted by user speech
console.log("Response interrupted by user");
break;
Production Considerations
Connection resilience: WebRTC connections drop. Implement automatic reconnection with exponential backoff. Cache the conversation history so the agent can resume context after reconnection.
Audio quality monitoring: Track audio levels and report silence or noise issues. A microphone that stops sending audio should trigger a user prompt, not silent confusion.
Cost management: The Realtime API bills per audio minute for both input and output. Implement idle timeout detection — if no speech is detected for 30 seconds, prompt the user or end the session.
Logging and compliance: For regulated industries, capture both the audio stream and the transcript. The Realtime API provides transcript events that you can log without additional STT costs.
FAQ
What is the latency difference between the WebRTC and WebSocket approaches?
WebRTC provides lower and more consistent latency because it uses UDP-based transport optimized for real-time media. Typical round-trip latency with WebRTC is 300-500ms. The WebSocket approach adds 100-200ms due to TCP overhead and the need to manually handle audio chunking. For browser-based applications, WebRTC is the recommended approach.
Can I use the Realtime API with non-English languages?
Yes. The GPT-4o Realtime model supports over 50 languages for both input and output audio. Set the language in the session instructions. Performance is strongest in English, Spanish, French, German, Japanese, and Mandarin. Less common languages may have higher word error rates.
How do I handle function calls that take more than a few seconds?
For long-running tools, send an intermediate response before the tool completes. You can use the conversation.item.create event to inject a message like "Let me look that up for you" while the tool executes. This prevents awkward silence during database queries or API calls that take 2-5 seconds.
What happens when the WebRTC connection drops mid-conversation?
The connection is lost and the session ends. You need to implement reconnection logic on the client side: detect the disconnect via pc.onconnectionstatechange, request a new ephemeral token, re-establish the WebRTC connection, and optionally replay conversation context. The Realtime API does not persist sessions across connections, so your backend should maintain conversation state.
#OpenAIRealtime #WebRTC #VoiceAgents #RealTimeAI #Twilio #ConversationalAI #VoiceDev
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.