Agentic AI with WebRTC: Developing Real-Time Voice Agent Interfaces

The Real-Time Voice Challenge for AI Agents

Building a voice AI agent that responds in real-time is fundamentally different from building a chatbot. Text-based interactions tolerate multi-second response times. Voice interactions do not. Humans perceive response delays beyond 300 milliseconds as unnatural, and delays beyond 800 milliseconds as broken. This means every component in your voice pipeline — audio capture, speech recognition, LLM reasoning, text-to-speech, and audio delivery — must be optimized for speed.

WebRTC (Web Real-Time Communication) is the browser-native protocol stack designed for exactly this kind of low-latency, peer-to-peer media streaming. It handles audio capture, codec negotiation, network traversal, and adaptive bitrate management out of the box. For voice AI agents, WebRTC provides the transport layer between the user's browser and your server-side agent infrastructure.

This guide covers the full integration: peer connection setup, audio streaming to LLMs, Voice Activity Detection, barge-in handling, codec selection, and NAT traversal configuration.

WebRTC Peer Connection Architecture for Voice Agents

A standard WebRTC voice agent setup involves the browser (client) establishing a peer connection with a media server that bridges to your AI backend.

Connection Flow

Client requests a session from your signaling server
Server creates an RTCPeerConnection and generates an SDP offer
Client receives the offer and sends back an SDP answer
ICE candidates are exchanged for NAT traversal
Media flows — client sends audio, server sends agent audio back

Server-Side Peer Connection Setup

const { RTCPeerConnection } = require("wrtc");

async function createVoiceSession(sessionId) {
  const pc = new RTCPeerConnection({
    iceServers: [
      { urls: "stun:stun.l.google.com:19302" },
      {
        urls: "turn:turn.yourserver.com:3478",
        username: "agent",
        credential: process.env.TURN_SECRET,
      },
    ],
  });

  pc.ontrack = (event) => {
    const audioStream = event.streams[0];
    const audioTrack = audioStream.getAudioTracks()[0];
    // Route audio to STT pipeline
    routeToSTTPipeline(sessionId, audioTrack);
  };

  // Create audio track for agent responses
  const agentAudioSource = createAgentAudioSource(sessionId);
  pc.addTrack(agentAudioSource);

  const offer = await pc.createOffer();
  await pc.setLocalDescription(offer);

  return { pc, offer: pc.localDescription };
}

Client-Side Connection

async function connectToVoiceAgent() {
  const pc = new RTCPeerConnection({
    iceServers: [{ urls: "stun:stun.l.google.com:19302" }],
  });

  // Capture microphone audio
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: {
      echoCancellation: true,
      noiseSuppression: true,
      autoGainControl: true,
      sampleRate: 16000,
    },
  });

  stream.getTracks().forEach((track) => pc.addTrack(track, stream));

  // Handle agent audio playback
  pc.ontrack = (event) => {
    const audioEl = document.getElementById("agent-audio");
    audioEl.srcObject = event.streams[0];
  };

  // Exchange SDP with signaling server
  const response = await fetch("/api/voice/session", { method: "POST" });
  const { offer } = await response.json();

  await pc.setRemoteDescription(offer);
  const answer = await pc.createAnswer();
  await pc.setLocalDescription(answer);

  await fetch("/api/voice/answer", {
    method: "POST",
    body: JSON.stringify({ answer: pc.localDescription }),
  });
}

Audio Streaming to LLM Backends

Once audio arrives at your server, it needs to reach the speech-to-text engine with minimal latency. Two approaches dominate.

Chunked Streaming

Audio is buffered into small chunks (100-200ms) and sent to the STT service as they arrive. This works with APIs that support streaming input like Deepgram and AssemblyAI.

import asyncio
from collections import deque

class AudioChunkBuffer:
    def __init__(self, chunk_duration_ms: int = 100, sample_rate: int = 16000):
        self.chunk_size = int(sample_rate * chunk_duration_ms / 1000)
        self.buffer = deque()
        self.overflow = b""

    async def add_audio(self, data: bytes):
        combined = self.overflow + data
        offset = 0
        while offset + self.chunk_size * 2 <= len(combined):
            chunk = combined[offset:offset + self.chunk_size * 2]
            self.buffer.append(chunk)
            offset += self.chunk_size * 2
        self.overflow = combined[offset:]

    async def get_chunk(self) -> bytes:
        while not self.buffer:
            await asyncio.sleep(0.01)
        return self.buffer.popleft()

Direct WebSocket Bridge

For services like Deepgram that accept WebSocket audio streams, you can bridge the WebRTC audio track directly to a WebSocket connection, eliminating intermediate buffering.

Voice Activity Detection (VAD)

VAD determines when the user is speaking versus when they are silent. This is critical for knowing when to send accumulated audio to STT and when the user has finished their utterance.

Server-Side VAD with Silero

Silero VAD is a lightweight, accurate neural network model that runs efficiently on CPU.

import torch

class SileroVAD:
    def __init__(self, threshold: float = 0.5):
        self.model, self.utils = torch.hub.load(
            "snakers4/silero-vad", "silero_vad"
        )
        self.threshold = threshold
        self.is_speaking = False
        self.silence_frames = 0
        self.silence_limit = 15  # ~480ms at 32ms per frame

    def process_frame(self, audio_frame: torch.Tensor) -> dict:
        confidence = self.model(audio_frame, 16000).item()
        was_speaking = self.is_speaking

        if confidence >= self.threshold:
            self.is_speaking = True
            self.silence_frames = 0
        else:
            self.silence_frames += 1
            if self.silence_frames >= self.silence_limit:
                self.is_speaking = False

        return {
            "is_speaking": self.is_speaking,
            "confidence": confidence,
            "speech_ended": was_speaking and not self.is_speaking,
        }

When VAD detects that the user has stopped speaking (speech_ended becomes True), the accumulated audio buffer is finalized and sent to the STT service for processing.

Barge-In Handling

Barge-in occurs when the user starts speaking while the agent is still talking. Proper barge-in handling is what separates a professional voice agent from a frustrating one.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Detection Strategy

Monitor the user's audio stream with VAD even while the agent is speaking. When user speech is detected during agent output, you have a barge-in event.

Response to Barge-In

Immediately stop agent audio playback — Cease sending TTS audio to the client
Cancel pending TTS generation — If TTS is still generating, abort the request
Cancel pending LLM generation — If the agent is mid-response, abort that too
Begin capturing new user utterance — Start buffering the barge-in audio for STT
Preserve context — The agent should know it was interrupted and what it was saying when interrupted

class BargeInHandler:
    def __init__(self, vad: SileroVAD, agent_audio_controller):
        self.vad = vad
        self.agent_audio = agent_audio_controller
        self.is_agent_speaking = False

    async def on_user_audio_frame(self, frame):
        result = self.vad.process_frame(frame)

        if result["is_speaking"] and self.is_agent_speaking:
            # Barge-in detected
            await self.agent_audio.stop_playback()
            await self.agent_audio.cancel_pending_tts()
            return {"event": "barge_in", "frame": frame}

        return {"event": "normal", "vad": result}

Codec Selection: Opus vs PCM16

Codec choice significantly affects both latency and audio quality.

Codec	Bitrate	Latency	Quality	Use Case
Opus	16-64 kbps	5-20ms frame	Excellent	Production voice agents
PCM16	256 kbps	0ms codec delay	Lossless	Direct STT integration
G.711 (PCMU/PCMA)	64 kbps	0.125ms	Acceptable	Legacy telephony bridges

Opus is the default choice for WebRTC voice agents. It provides excellent quality at low bitrates, has built-in forward error correction for handling packet loss, and adds minimal latency. Most STT services accept Opus-encoded audio directly.

PCM16 (raw 16-bit PCM at 16kHz) is preferred when streaming directly to STT APIs that expect uncompressed audio. It avoids encode/decode overhead but uses significantly more bandwidth.

For production voice agents, use Opus for the WebRTC transport and decode to PCM16 on the server side before feeding to STT if the STT service requires raw audio.

ICE, STUN, and TURN Configuration

Network Address Translation (NAT) is the most common cause of WebRTC connection failures. ICE (Interactive Connectivity Establishment) uses STUN and TURN servers to traverse NATs.

STUN

STUN servers help clients discover their public IP address and port mapping. Google provides free STUN servers, but for production, run your own to avoid dependency on third-party infrastructure.

TURN

TURN servers relay media when direct peer-to-peer connections are impossible (symmetric NATs, restrictive firewalls). TURN is essential for production — approximately 15-20% of connections require TURN relay in typical enterprise network environments.

# coturn configuration for voice agent TURN server
listening-port=3478
tls-listening-port=5349
realm=voice.yourcompany.com
server-name=voice.yourcompany.com
lt-cred-mech
use-auth-secret
static-auth-secret=${TURN_SECRET}
total-quota=100
max-bps=256000
no-multicast-peers

Production ICE Configuration

const iceConfig = {
  iceServers: [
    { urls: "stun:stun.yourserver.com:3478" },
    {
      urls: [
        "turn:turn.yourserver.com:3478?transport=udp",
        "turn:turn.yourserver.com:3478?transport=tcp",
        "turns:turn.yourserver.com:5349?transport=tcp",
      ],
      username: dynamicUsername,
      credential: dynamicCredential,
    },
  ],
  iceTransportPolicy: "all", // Use "relay" to force TURN for testing
  iceCandidatePoolSize: 2,
};

Production Deployment Patterns

CallSphere's healthcare and real estate voice agents use a production architecture where the WebRTC media server runs as a horizontally scaled Kubernetes deployment. Each pod handles up to 50 concurrent voice sessions. A session affinity layer ensures that all signaling and media for a given session route to the same pod.

Key production considerations include health checking the media server pods with actual WebRTC connection tests (not just HTTP pings), monitoring ICE connection success rates and TURN relay percentages, implementing session draining during deployments so active calls are not dropped, and tracking end-to-end latency from user utterance to agent audio playback start.

Frequently Asked Questions

What is the minimum latency achievable for a WebRTC voice AI agent?

With optimized infrastructure, end-to-end latency (user finishes speaking to agent audio begins playing) of 500-800ms is achievable. This breaks down roughly as: VAD endpoint detection ~300ms, STT ~100ms (streaming), LLM inference ~200-400ms (streaming first tokens), TTS first chunk ~100ms. Achieving sub-500ms requires real-time STT APIs, streaming LLM output, and streaming TTS.

Do I need TURN servers for a voice AI agent?

Yes, for production. While STUN handles most connections, roughly 15-20% of users are behind symmetric NATs or restrictive firewalls that require TURN relay. Without TURN, those users simply cannot connect. Run your own TURN servers (coturn is the standard open-source option) rather than relying on third-party services for production workloads.

How does barge-in work with WebRTC voice agents?

Barge-in is handled by running Voice Activity Detection on the user's incoming audio stream continuously, even while the agent is speaking. When the VAD detects user speech during agent output, the system immediately stops agent audio playback, cancels any pending TTS and LLM generation, and begins processing the new user utterance. The key is preserving conversation context so the agent knows it was interrupted.

What audio format should I use between WebRTC and the STT service?

Use Opus for the WebRTC transport layer (it is the default and provides excellent quality with low bandwidth). On the server side, decode to PCM16 at 16kHz mono if your STT service requires raw audio. Many modern STT services (Deepgram, AssemblyAI) accept Opus directly, eliminating the decode step and reducing server-side processing.

How do CallSphere voice agents handle network quality issues?

CallSphere's voice agents use Opus codec with forward error correction enabled, adaptive bitrate based on network conditions detected through WebRTC stats, TURN fallback for clients that cannot establish direct connections, and client-side audio buffering with jitter compensation to smooth out network variability.