Building a Voice Agent with OpenAI Realtime API: Complete Tutorial

What Is the OpenAI Realtime API

The OpenAI Realtime API provides a single WebSocket connection that handles speech-to-text, language model reasoning, and text-to-speech in one unified pipeline. Instead of stitching together separate STT, LLM, and TTS services, you send audio in and get audio back — with built-in VAD, turn detection, and function calling.

This dramatically simplifies voice agent development and achieves lower latency than a three-stage pipeline because the model processes speech natively without intermediate text conversion. In this tutorial, you will build a complete voice agent from scratch.

Step 1: WebSocket Connection Setup

The Realtime API uses WebSocket connections with specific authentication and configuration.

import asyncio
import websockets
import json
import base64
import os

class RealtimeVoiceAgent:
    def __init__(self):
        self.ws = None
        self.api_key = os.environ["OPENAI_API_KEY"]
        self.url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"

    async def connect(self):
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "OpenAI-Beta": "realtime=v1",
        }

        self.ws = await websockets.connect(
            self.url,
            additional_headers=headers,
            ping_interval=20,
            ping_timeout=10,
        )

        # Configure the session
        await self.send_event({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": (
                    "You are a helpful voice assistant. "
                    "Keep responses concise — under 3 sentences. "
                    "Speak naturally and conversationally."
                ),
                "voice": "alloy",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {
                    "model": "whisper-1",
                },
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 700,
                },
                "tools": self._get_tools(),
            },
        })

        print("Connected to OpenAI Realtime API")
        return self

    async def send_event(self, event: dict):
        await self.ws.send(json.dumps(event))

Step 2: Audio Streaming

Send microphone audio to the API as base64-encoded PCM16 chunks, and play back the audio responses.

import numpy as np
import sounddevice as sd

class AudioStreamer:
    def __init__(self, agent: RealtimeVoiceAgent):
        self.agent = agent
        self.sample_rate = 24000  # Realtime API uses 24kHz
        self.chunk_size = 2400    # 100ms chunks
        self.playback_buffer = asyncio.Queue()

    async def start_recording(self):
        """Capture microphone audio and stream to API."""
        loop = asyncio.get_event_loop()

        def audio_callback(indata, frames, time_info, status):
            if status:
                print(f"Audio input status: {status}")
            # Convert float32 to PCM16
            pcm16 = (indata[:, 0] * 32768).astype(np.int16)
            audio_b64 = base64.b64encode(pcm16.tobytes()).decode()

            asyncio.run_coroutine_threadsafe(
                self.agent.send_event({
                    "type": "input_audio_buffer.append",
                    "audio": audio_b64,
                }),
                loop,
            )

        stream = sd.InputStream(
            samplerate=self.sample_rate,
            channels=1,
            dtype='float32',
            blocksize=self.chunk_size,
            callback=audio_callback,
        )
        stream.start()
        return stream

    async def play_audio(self):
        """Play audio chunks from the playback buffer."""
        stream = sd.OutputStream(
            samplerate=self.sample_rate,
            channels=1,
            dtype='int16',
        )
        stream.start()

        while True:
            chunk = await self.playback_buffer.get()
            if chunk is None:
                break
            stream.write(chunk)

Step 3: Event Handling

The Realtime API sends various events: transcription updates, audio deltas, function calls, and error notifications. Your agent needs to handle each type.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

async def listen_for_events(self):
    """Main event loop — process all events from the API."""
    async for message in self.ws:
        event = json.loads(message)
        event_type = event.get("type", "")

        if event_type == "response.audio.delta":
            # Incoming audio from the agent
            audio_bytes = base64.b64decode(event["delta"])
            audio_array = np.frombuffer(audio_bytes, dtype=np.int16)
            await self.audio_streamer.playback_buffer.put(audio_array)

        elif event_type == "response.audio_transcript.delta":
            # Real-time transcript of agent's speech
            print(f"Agent: {event['delta']}", end="", flush=True)

        elif event_type == "conversation.item.input_audio_transcription.completed":
            # Transcript of user's speech
            print(f"\nUser: {event['transcript']}")

        elif event_type == "response.function_call_arguments.done":
            # Function call from the agent
            await self._handle_function_call(event)

        elif event_type == "input_audio_buffer.speech_started":
            print("\n[User started speaking]")

        elif event_type == "input_audio_buffer.speech_stopped":
            print("[User stopped speaking]")

        elif event_type == "error":
            print(f"Error: {event['error']['message']}")

        elif event_type == "response.done":
            print("\n[Response complete]")

Step 4: Function Calling (Tool Use)

The Realtime API supports function calling, allowing your voice agent to perform actions like checking calendars, looking up information, or booking appointments.

def _get_tools(self):
    return [
        {
            "type": "function",
            "name": "check_appointment_availability",
            "description": "Check available appointment slots for a given date",
            "parameters": {
                "type": "object",
                "properties": {
                    "date": {
                        "type": "string",
                        "description": "Date in YYYY-MM-DD format",
                    },
                    "service_type": {
                        "type": "string",
                        "enum": ["consultation", "follow-up", "emergency"],
                    },
                },
                "required": ["date"],
            },
        },
        {
            "type": "function",
            "name": "book_appointment",
            "description": "Book an appointment for the caller",
            "parameters": {
                "type": "object",
                "properties": {
                    "date": {"type": "string"},
                    "time": {"type": "string"},
                    "patient_name": {"type": "string"},
                    "service_type": {"type": "string"},
                },
                "required": ["date", "time", "patient_name"],
            },
        },
    ]

async def _handle_function_call(self, event):
    """Execute the function and send results back to the API."""
    fn_name = event["name"]
    call_id = event["call_id"]
    args = json.loads(event["arguments"])

    print(f"\n[Calling function: {fn_name}({args})]")

    # Execute the actual function
    if fn_name == "check_appointment_availability":
        result = await self.check_availability(args["date"], args.get("service_type"))
    elif fn_name == "book_appointment":
        result = await self.book_appointment(**args)
    else:
        result = {"error": f"Unknown function: {fn_name}"}

    # Send function result back to the API
    await self.send_event({
        "type": "conversation.item.create",
        "item": {
            "type": "function_call_output",
            "call_id": call_id,
            "output": json.dumps(result),
        },
    })

    # Trigger a new response incorporating the function result
    await self.send_event({"type": "response.create"})

Step 5: Session Management

Production voice agents need proper session lifecycle management — handling disconnections, timeouts, and cleanup.

class SessionManager:
    def __init__(self):
        self.sessions = {}

    async def create_session(self, session_id: str) -> RealtimeVoiceAgent:
        agent = RealtimeVoiceAgent()
        await agent.connect()

        self.sessions[session_id] = {
            "agent": agent,
            "created_at": asyncio.get_event_loop().time(),
            "last_activity": asyncio.get_event_loop().time(),
        }

        return agent

    async def cleanup_session(self, session_id: str):
        session = self.sessions.pop(session_id, None)
        if session and session["agent"].ws:
            await session["agent"].ws.close()
            print(f"Session {session_id} cleaned up")

    async def cleanup_stale_sessions(self, max_idle_seconds: int = 300):
        """Remove sessions idle for more than max_idle_seconds."""
        now = asyncio.get_event_loop().time()
        stale = [
            sid for sid, data in self.sessions.items()
            if now - data["last_activity"] > max_idle_seconds
        ]
        for sid in stale:
            await self.cleanup_session(sid)
        if stale:
            print(f"Cleaned up {len(stale)} stale sessions")

Step 6: Running the Complete Agent

async def main():
    agent = RealtimeVoiceAgent()
    await agent.connect()

    streamer = AudioStreamer(agent)
    agent.audio_streamer = streamer

    # Start recording and event listening concurrently
    mic_stream = await streamer.start_recording()

    try:
        await asyncio.gather(
            agent.listen_for_events(),
            streamer.play_audio(),
        )
    except KeyboardInterrupt:
        print("\nShutting down...")
    finally:
        mic_stream.stop()
        await agent.ws.close()

if __name__ == "__main__":
    asyncio.run(main())

Run the agent and speak into your microphone. The Realtime API handles VAD, transcription, reasoning, and speech synthesis in a single round trip.

FAQ

How does the Realtime API compare to building a separate STT-LLM-TTS pipeline?

The Realtime API is simpler to implement and achieves lower latency because audio goes directly to the model without intermediate text conversion. However, you lose flexibility — you cannot swap individual STT or TTS providers, and you are locked into OpenAI's pricing. A custom pipeline gives you more control over each stage, lets you use specialized models, and can be cheaper at scale. Many teams prototype with the Realtime API and then build a custom pipeline as they scale.

What happens if the WebSocket connection drops mid-conversation?

The Realtime API does not persist session state across connections. If the WebSocket drops, you need to reconnect and resend the session configuration. To maintain conversation context across reconnections, store the conversation history on your server and include relevant context in the new session's instructions. Implementing automatic reconnection with exponential backoff is essential for production deployments.

How much does the Realtime API cost compared to separate services?

The Realtime API prices audio input at approximately $0.06 per minute and audio output at approximately $0.24 per minute — significantly more expensive than separate STT plus LLM plus TTS. For low-volume applications (under a few hundred minutes per day), the development speed advantage outweighs the cost. At higher volumes, a custom pipeline with Deepgram STT plus GPT-4o-mini plus OpenAI TTS can be 3-5x cheaper.

#OpenAIRealtimeAPI #VoiceAgent #WebSocket #FunctionCalling #Tutorial #VoiceAI #AgenticAI #LearnAI #AIEngineering

Building a Voice Agent with OpenAI Realtime API: Complete Tutorial

What Is the OpenAI Realtime API

Step 1: WebSocket Connection Setup

Step 2: Audio Streaming

Step 3: Event Handling

Step 4: Function Calling (Tool Use)

Step 5: Session Management

Step 6: Running the Complete Agent

FAQ

How does the Realtime API compare to building a separate STT-LLM-TTS pipeline?

What happens if the WebSocket connection drops mid-conversation?

How much does the Realtime API cost compared to separate services?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding