Building a Voice Agent with OpenAI Realtime API: Complete Tutorial
A step-by-step tutorial for building a voice AI agent using the OpenAI Realtime API — covering WebSocket setup, audio streaming, function calling, session management, and production deployment patterns.
What Is the OpenAI Realtime API
The OpenAI Realtime API provides a single WebSocket connection that handles speech-to-text, language model reasoning, and text-to-speech in one unified pipeline. Instead of stitching together separate STT, LLM, and TTS services, you send audio in and get audio back — with built-in VAD, turn detection, and function calling.
This dramatically simplifies voice agent development and achieves lower latency than a three-stage pipeline because the model processes speech natively without intermediate text conversion. In this tutorial, you will build a complete voice agent from scratch.
Step 1: WebSocket Connection Setup
The Realtime API uses WebSocket connections with specific authentication and configuration.
import asyncio
import websockets
import json
import base64
import os
class RealtimeVoiceAgent:
def __init__(self):
self.ws = None
self.api_key = os.environ["OPENAI_API_KEY"]
self.url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
async def connect(self):
headers = {
"Authorization": f"Bearer {self.api_key}",
"OpenAI-Beta": "realtime=v1",
}
self.ws = await websockets.connect(
self.url,
additional_headers=headers,
ping_interval=20,
ping_timeout=10,
)
# Configure the session
await self.send_event({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"instructions": (
"You are a helpful voice assistant. "
"Keep responses concise — under 3 sentences. "
"Speak naturally and conversationally."
),
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "whisper-1",
},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 700,
},
"tools": self._get_tools(),
},
})
print("Connected to OpenAI Realtime API")
return self
async def send_event(self, event: dict):
await self.ws.send(json.dumps(event))
Step 2: Audio Streaming
Send microphone audio to the API as base64-encoded PCM16 chunks, and play back the audio responses.
import numpy as np
import sounddevice as sd
class AudioStreamer:
def __init__(self, agent: RealtimeVoiceAgent):
self.agent = agent
self.sample_rate = 24000 # Realtime API uses 24kHz
self.chunk_size = 2400 # 100ms chunks
self.playback_buffer = asyncio.Queue()
async def start_recording(self):
"""Capture microphone audio and stream to API."""
loop = asyncio.get_event_loop()
def audio_callback(indata, frames, time_info, status):
if status:
print(f"Audio input status: {status}")
# Convert float32 to PCM16
pcm16 = (indata[:, 0] * 32768).astype(np.int16)
audio_b64 = base64.b64encode(pcm16.tobytes()).decode()
asyncio.run_coroutine_threadsafe(
self.agent.send_event({
"type": "input_audio_buffer.append",
"audio": audio_b64,
}),
loop,
)
stream = sd.InputStream(
samplerate=self.sample_rate,
channels=1,
dtype='float32',
blocksize=self.chunk_size,
callback=audio_callback,
)
stream.start()
return stream
async def play_audio(self):
"""Play audio chunks from the playback buffer."""
stream = sd.OutputStream(
samplerate=self.sample_rate,
channels=1,
dtype='int16',
)
stream.start()
while True:
chunk = await self.playback_buffer.get()
if chunk is None:
break
stream.write(chunk)
Step 3: Event Handling
The Realtime API sends various events: transcription updates, audio deltas, function calls, and error notifications. Your agent needs to handle each type.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
async def listen_for_events(self):
"""Main event loop — process all events from the API."""
async for message in self.ws:
event = json.loads(message)
event_type = event.get("type", "")
if event_type == "response.audio.delta":
# Incoming audio from the agent
audio_bytes = base64.b64decode(event["delta"])
audio_array = np.frombuffer(audio_bytes, dtype=np.int16)
await self.audio_streamer.playback_buffer.put(audio_array)
elif event_type == "response.audio_transcript.delta":
# Real-time transcript of agent's speech
print(f"Agent: {event['delta']}", end="", flush=True)
elif event_type == "conversation.item.input_audio_transcription.completed":
# Transcript of user's speech
print(f"\nUser: {event['transcript']}")
elif event_type == "response.function_call_arguments.done":
# Function call from the agent
await self._handle_function_call(event)
elif event_type == "input_audio_buffer.speech_started":
print("\n[User started speaking]")
elif event_type == "input_audio_buffer.speech_stopped":
print("[User stopped speaking]")
elif event_type == "error":
print(f"Error: {event['error']['message']}")
elif event_type == "response.done":
print("\n[Response complete]")
Step 4: Function Calling (Tool Use)
The Realtime API supports function calling, allowing your voice agent to perform actions like checking calendars, looking up information, or booking appointments.
def _get_tools(self):
return [
{
"type": "function",
"name": "check_appointment_availability",
"description": "Check available appointment slots for a given date",
"parameters": {
"type": "object",
"properties": {
"date": {
"type": "string",
"description": "Date in YYYY-MM-DD format",
},
"service_type": {
"type": "string",
"enum": ["consultation", "follow-up", "emergency"],
},
},
"required": ["date"],
},
},
{
"type": "function",
"name": "book_appointment",
"description": "Book an appointment for the caller",
"parameters": {
"type": "object",
"properties": {
"date": {"type": "string"},
"time": {"type": "string"},
"patient_name": {"type": "string"},
"service_type": {"type": "string"},
},
"required": ["date", "time", "patient_name"],
},
},
]
async def _handle_function_call(self, event):
"""Execute the function and send results back to the API."""
fn_name = event["name"]
call_id = event["call_id"]
args = json.loads(event["arguments"])
print(f"\n[Calling function: {fn_name}({args})]")
# Execute the actual function
if fn_name == "check_appointment_availability":
result = await self.check_availability(args["date"], args.get("service_type"))
elif fn_name == "book_appointment":
result = await self.book_appointment(**args)
else:
result = {"error": f"Unknown function: {fn_name}"}
# Send function result back to the API
await self.send_event({
"type": "conversation.item.create",
"item": {
"type": "function_call_output",
"call_id": call_id,
"output": json.dumps(result),
},
})
# Trigger a new response incorporating the function result
await self.send_event({"type": "response.create"})
Step 5: Session Management
Production voice agents need proper session lifecycle management — handling disconnections, timeouts, and cleanup.
class SessionManager:
def __init__(self):
self.sessions = {}
async def create_session(self, session_id: str) -> RealtimeVoiceAgent:
agent = RealtimeVoiceAgent()
await agent.connect()
self.sessions[session_id] = {
"agent": agent,
"created_at": asyncio.get_event_loop().time(),
"last_activity": asyncio.get_event_loop().time(),
}
return agent
async def cleanup_session(self, session_id: str):
session = self.sessions.pop(session_id, None)
if session and session["agent"].ws:
await session["agent"].ws.close()
print(f"Session {session_id} cleaned up")
async def cleanup_stale_sessions(self, max_idle_seconds: int = 300):
"""Remove sessions idle for more than max_idle_seconds."""
now = asyncio.get_event_loop().time()
stale = [
sid for sid, data in self.sessions.items()
if now - data["last_activity"] > max_idle_seconds
]
for sid in stale:
await self.cleanup_session(sid)
if stale:
print(f"Cleaned up {len(stale)} stale sessions")
Step 6: Running the Complete Agent
async def main():
agent = RealtimeVoiceAgent()
await agent.connect()
streamer = AudioStreamer(agent)
agent.audio_streamer = streamer
# Start recording and event listening concurrently
mic_stream = await streamer.start_recording()
try:
await asyncio.gather(
agent.listen_for_events(),
streamer.play_audio(),
)
except KeyboardInterrupt:
print("\nShutting down...")
finally:
mic_stream.stop()
await agent.ws.close()
if __name__ == "__main__":
asyncio.run(main())
Run the agent and speak into your microphone. The Realtime API handles VAD, transcription, reasoning, and speech synthesis in a single round trip.
FAQ
How does the Realtime API compare to building a separate STT-LLM-TTS pipeline?
The Realtime API is simpler to implement and achieves lower latency because audio goes directly to the model without intermediate text conversion. However, you lose flexibility — you cannot swap individual STT or TTS providers, and you are locked into OpenAI's pricing. A custom pipeline gives you more control over each stage, lets you use specialized models, and can be cheaper at scale. Many teams prototype with the Realtime API and then build a custom pipeline as they scale.
What happens if the WebSocket connection drops mid-conversation?
The Realtime API does not persist session state across connections. If the WebSocket drops, you need to reconnect and resend the session configuration. To maintain conversation context across reconnections, store the conversation history on your server and include relevant context in the new session's instructions. Implementing automatic reconnection with exponential backoff is essential for production deployments.
How much does the Realtime API cost compared to separate services?
The Realtime API prices audio input at approximately $0.06 per minute and audio output at approximately $0.24 per minute — significantly more expensive than separate STT plus LLM plus TTS. For low-volume applications (under a few hundred minutes per day), the development speed advantage outweighs the cost. At higher volumes, a custom pipeline with Deepgram STT plus GPT-4o-mini plus OpenAI TTS can be 3-5x cheaper.
#OpenAIRealtimeAPI #VoiceAgent #WebSocket #FunctionCalling #Tutorial #VoiceAI #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.