Building Voice Agents with the OpenAI Realtime API: Full Tutorial

Why this API changed the playbook

Before the Realtime API, building a voice agent meant wiring together Whisper (or Deepgram), an LLM, and a TTS service over three separate connections, then fighting a constant battle with latency and interruption handling. The Realtime API collapses all three into one WebSocket that streams audio in and audio out and surfaces a clean event model for interruptions and tool calls.

This is a hands-on tutorial for building a working voice agent on top of the Realtime API. It does not assume a telephony provider — you can run everything locally with a laptop microphone first, then swap in Twilio later.

mic  ──PCM16──►  Realtime API  ──PCM16──►  speaker
                      │
                      ├── session.created
                      ├── input_audio_buffer.speech_started
                      ├── response.audio.delta
                      ├── response.function_call_arguments.done
                      └── response.done

Architecture overview

┌───────────────────────────────┐
│ Node.js client                │
│ • sounddevice / portaudio     │
│ • WebSocket to Realtime API   │
│ • tool dispatcher             │
└───────────────┬───────────────┘
                │
                ▼
┌───────────────────────────────┐
│ OpenAI Realtime API           │
│ gpt-4o-realtime-preview-      │
│ 2025-06-03                    │
└───────────────────────────────┘

Prerequisites

Node.js 20+ or Python 3.11+.
An OpenAI API key with Realtime access.
PortAudio (macOS: brew install portaudio, Linux: apt install libportaudio2).
Basic familiarity with WebSocket events.

Step-by-step walkthrough

1. Open the WebSocket and configure the session

import WebSocket from "ws";

const ws = new WebSocket(
  "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03",
  {
    headers: {
      Authorization: "Bearer " + process.env.OPENAI_API_KEY,
      "OpenAI-Beta": "realtime=v1",
    },
  },
);

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      voice: "alloy",
      instructions: "You are a friendly receptionist for Acme Clinic.",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      turn_detection: { type: "server_vad", silence_duration_ms: 400, threshold: 0.5 },
      tools: [
        {
          type: "function",
          name: "check_availability",
          description: "Check provider availability",
          parameters: {
            type: "object",
            properties: {
              provider_id: { type: "string" },
              date: { type: "string", description: "YYYY-MM-DD" },
            },
            required: ["provider_id", "date"],
          },
        },
      ],
    },
  }));
});

2. Stream microphone audio

import { spawn } from "child_process";

// arecord pipes PCM16 at 24kHz mono to stdout
const mic = spawn("arecord", ["-q", "-f", "S16_LE", "-r", "24000", "-c", "1", "-t", "raw"]);

mic.stdout.on("data", (chunk) => {
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: chunk.toString("base64"),
  }));
});

3. Play back the model's audio

import { spawn as spawn2 } from "child_process";

const speaker = spawn2("aplay", ["-q", "-f", "S16_LE", "-r", "24000", "-c", "1"]);

ws.on("message", (raw) => {
  const evt = JSON.parse(raw.toString());
  if (evt.type === "response.audio.delta") {
    speaker.stdin.write(Buffer.from(evt.delta, "base64"));
  }
});

4. Handle function calls

ws.on("message", async (raw) => {
  const evt = JSON.parse(raw.toString());
  if (evt.type === "response.function_call_arguments.done") {
    const args = JSON.parse(evt.arguments);
    let result: unknown;
    if (evt.name === "check_availability") {
      result = await checkAvailability(args.provider_id, args.date);
    }
    ws.send(JSON.stringify({
      type: "conversation.item.create",
      item: {
        type: "function_call_output",
        call_id: evt.call_id,
        output: JSON.stringify(result),
      },
    }));
    ws.send(JSON.stringify({ type: "response.create" }));
  }
});

5. Handle interruptions

When the caller starts speaking mid-response, clear the output buffer and cancel the in-flight response.

if (evt.type === "input_audio_buffer.speech_started") {
  ws.send(JSON.stringify({ type: "response.cancel" }));
}

6. Log the transcript

The Realtime API emits transcript deltas for both sides. Collect them for later analysis.

if (evt.type === "conversation.item.input_audio_transcription.completed") {
  console.log("user:", evt.transcript);
}
if (evt.type === "response.audio_transcript.done") {
  console.log("agent:", evt.transcript);
}

Production considerations

Heartbeats: send a WebSocket ping every 15s to keep the connection alive through proxies.
Reconnects: on unexpected close, reconnect with exponential backoff and replay the last session config.
Rate limits: the Realtime API has concurrent session limits per org. Monitor and scale your quota.
Cost: charge by input/output audio minute. Hang up on silence aggressively.
PII: the transcript contains everything callers say. Encrypt at rest and scope access.

CallSphere's real implementation

CallSphere uses the OpenAI Realtime API with gpt-4o-realtime-preview-2025-06-03 as the core of its voice and chat agents. Server VAD is on, audio is PCM16 at 24kHz, and every vertical ships its own tool schema: 14 tools for healthcare (insurance verification, appointment booking, provider lookup, and more), 10 agents for real estate, 4 for salon, 7 for after-hours escalation, 10 plus RAG for IT helpdesk, and an ElevenLabs TTS pod with 5 GPT-4 specialists for sales.

Multi-agent handoffs run through the OpenAI Agents SDK so a single caller can be routed from a triage agent to a specialist mid-call without dropping audio. Post-call analytics are handled by a GPT-4o-mini pipeline that writes sentiment, intent, and lead score into per-vertical Postgres. CallSphere supports 57+ languages and keeps end-to-end response time under one second.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Common pitfalls

Wrong sample rate: 16kHz audio will work but degrade quality; stick to 24kHz.
Not handling function_call_arguments.done: you will miss tool calls.
Pushing audio faster than realtime: the API expects near-realtime ingest; bursty pushes confuse VAD.
Ignoring response.done: you lose the end-of-turn signal.
No reconnect logic: the socket will drop eventually; plan for it.

FAQ

Can I use this with a phone number?

Yes — bridge Twilio Media Streams to your WebSocket server and forward audio in both directions.

What is the difference between server VAD and client VAD?

Server VAD runs on OpenAI's side and generates speech_started events automatically. Client VAD lets you control turn-taking manually. Start with server VAD.

How do I change the voice mid-call?

Send another session.update with the new voice name. Do it between turns, not during a response.

Does it support streaming function outputs back?

Yes — once you send the function_call_output item, the model picks up and continues speaking.

Can I use multiple tools in one turn?

Yes. The model can emit multiple tool calls, and you should respond to each before calling response.create.

Next steps

Want to see a full Realtime API deployment in production? Book a demo, explore the technology page, or browse pricing.

#CallSphere #OpenAIRealtime #VoiceAI #Tutorial #WebSocket #FunctionCalling #AIVoiceAgents