Voice Activity Detection: Knowing When Users Start and Stop Speaking

What Is Voice Activity Detection and Why Does It Matter

Voice Activity Detection (VAD) is the process of determining whether a given audio segment contains human speech or just background noise. In voice AI agents, VAD serves three critical functions: it tells the STT engine when to start and stop processing, it enables the agent to detect when the user has finished their turn (endpointing), and it allows barge-in detection when the user interrupts the agent.

Without good VAD, your agent either starts transcribing background noise (false positives, wasting resources and producing garbage text) or misses the beginning of user speech (false negatives, cutting off words and frustrating users).

Energy-Based VAD: The Simple Approach

The simplest VAD method measures the energy (volume) of each audio frame. If the energy exceeds a threshold, the frame is classified as speech.

import numpy as np
from collections import deque

class EnergyVAD:
    def __init__(
        self,
        threshold_db: float = -35.0,
        frame_duration_ms: int = 30,
        sample_rate: int = 16000,
        min_speech_ms: int = 200,
        min_silence_ms: int = 500,
    ):
        self.threshold_db = threshold_db
        self.frame_size = int(sample_rate * frame_duration_ms / 1000)
        self.min_speech_frames = int(min_speech_ms / frame_duration_ms)
        self.min_silence_frames = int(min_silence_ms / frame_duration_ms)
        self.speech_count = 0
        self.silence_count = 0
        self.is_speaking = False

    def compute_rms_db(self, frame: np.ndarray) -> float:
        rms = np.sqrt(np.mean(frame.astype(np.float32) ** 2))
        if rms == 0:
            return -100.0
        return 20 * np.log10(rms / 32768.0)

    def process_frame(self, frame: np.ndarray) -> dict:
        energy_db = self.compute_rms_db(frame)
        is_speech_frame = energy_db > self.threshold_db

        if is_speech_frame:
            self.speech_count += 1
            self.silence_count = 0
        else:
            self.silence_count += 1
            self.speech_count = 0

        # State transitions
        event = None
        if not self.is_speaking and self.speech_count >= self.min_speech_frames:
            self.is_speaking = True
            event = "speech_start"
        elif self.is_speaking and self.silence_count >= self.min_silence_frames:
            self.is_speaking = False
            event = "speech_end"

        return {
            "is_speaking": self.is_speaking,
            "energy_db": energy_db,
            "event": event,
        }

Energy-based VAD is fast and requires zero dependencies, but it struggles in noisy environments. A loud air conditioner or keyboard typing can easily exceed the threshold, triggering false positives.

ML-Based VAD: Silero VAD

Silero VAD is a lightweight neural network trained specifically for voice activity detection. It runs in real time on CPU and dramatically outperforms energy-based methods in noisy conditions.

import torch
import numpy as np

class SileroVAD:
    def __init__(self, threshold: float = 0.5, sample_rate: int = 16000):
        self.model, self.utils = torch.hub.load(
            repo_or_dir="snakers4/silero-vad",
            model="silero_vad",
            trust_repo=True,
        )
        self.threshold = threshold
        self.sample_rate = sample_rate
        self.is_speaking = False
        self.speech_frames = 0
        self.silence_frames = 0

    def process_chunk(self, audio_chunk: np.ndarray) -> dict:
        """Process a 512-sample chunk (32ms at 16kHz)."""
        tensor = torch.from_numpy(audio_chunk).float()

        # Silero returns a probability of speech
        speech_prob = self.model(tensor, self.sample_rate).item()

        event = None
        if speech_prob >= self.threshold:
            self.speech_frames += 1
            self.silence_frames = 0
            if not self.is_speaking and self.speech_frames >= 4:
                self.is_speaking = True
                event = "speech_start"
        else:
            self.silence_frames += 1
            self.speech_frames = 0
            if self.is_speaking and self.silence_frames >= 16:
                self.is_speaking = False
                event = "speech_end"

        return {
            "speech_probability": speech_prob,
            "is_speaking": self.is_speaking,
            "event": event,
        }

# Usage
vad = SileroVAD(threshold=0.5)

def handle_audio_frame(frame):
    result = vad.process_chunk(frame)
    if result["event"] == "speech_start":
        print("User started speaking — activate STT")
    elif result["event"] == "speech_end":
        print("User stopped speaking — finalize transcript")

Silero VAD runs at less than 1ms per chunk on CPU, making it suitable for real-time applications. The model is only about 2MB, so it can even run in the browser via ONNX Runtime.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Browser-Side VAD with JavaScript

Running VAD in the browser reduces server load and enables faster speech detection because there is no network round-trip.

class BrowserVAD {
  constructor(options = {}) {
    this.threshold = options.threshold || 0.5;
    this.onSpeechStart = options.onSpeechStart || (() => {});
    this.onSpeechEnd = options.onSpeechEnd || (() => {});
    this.isSpeaking = false;
    this.silenceFrames = 0;
    this.silenceLimit = options.silenceFrames || 15;
  }

  async init() {
    // Load Silero VAD ONNX model in the browser
    const { useMicVAD } = await import('@ricky0123/vad-web');

    this.vad = await useMicVAD({
      positiveSpeechThreshold: this.threshold,
      negativeSpeechThreshold: this.threshold - 0.15,
      minSpeechFrames: 4,
      preSpeechPadFrames: 3,
      redemptionFrames: 8,
      onSpeechStart: () => {
        this.isSpeaking = true;
        this.onSpeechStart();
      },
      onSpeechEnd: (audio) => {
        this.isSpeaking = false;
        this.onSpeechEnd(audio);
      },
    });
  }

  start() { this.vad.start(); }
  pause() { this.vad.pause(); }
  destroy() { this.vad.destroy(); }
}

// Usage
const vad = new BrowserVAD({
  threshold: 0.5,
  onSpeechStart: () => console.log('Speech detected — open STT stream'),
  onSpeechEnd: (audio) => {
    console.log('Speech ended — send audio to server');
    sendAudioToServer(audio);
  },
});
await vad.init();
vad.start();

Tuning VAD Sensitivity

The key parameters to tune are the speech probability threshold, minimum speech duration, and silence timeout.

Threshold too low (0.3): More false positives — background noise triggers speech detection
Threshold too high (0.8): More false negatives — quiet or soft speech is missed
Silence timeout too short (200ms): Cuts off speech during natural pauses
Silence timeout too long (1500ms): Agent waits too long before responding

A good starting point is a threshold of 0.5, minimum speech of 150ms, and silence timeout of 600-800ms. From there, tune based on your specific environment and user feedback.

FAQ

Should I run VAD on the client or the server?

Running VAD on the client is ideal for bandwidth optimization — you only send audio to the server when speech is detected. This can reduce bandwidth by 60-80% in typical conversations. However, server-side VAD gives you more control and consistency. Many production systems run VAD on both sides: client-side for bandwidth savings and server-side for reliable endpointing.

How does VAD interact with echo cancellation?

Without echo cancellation, VAD will detect the agent's own speech playing through the speakers as user speech, creating a feedback loop. WebRTC's built-in AEC (Acoustic Echo Cancellation) handles this automatically. If you are using raw audio streams without WebRTC, you need to implement echo cancellation before VAD, or use a reference signal to suppress the agent's output from the input stream.

Can VAD distinguish between speech and non-speech sounds like coughing or typing?

ML-based VAD models like Silero are specifically trained to detect human speech patterns, so they handle most non-speech sounds well. However, they can still be triggered by sounds that resemble speech patterns, such as music with vocals or TV audio in the background. For these edge cases, combining VAD with a short STT verification step — checking if the transcription is meaningful — provides an additional layer of filtering.

#VoiceActivityDetection #VAD #SileroVAD #VoiceAI #AudioProcessing #SpeechDetection #AgenticAI #LearnAI #AIEngineering

Voice Activity Detection: Knowing When Users Start and Stop Speaking

What Is Voice Activity Detection and Why Does It Matter

Energy-Based VAD: The Simple Approach

ML-Based VAD: Silero VAD

Browser-Side VAD with JavaScript

Tuning VAD Sensitivity

FAQ

Should I run VAD on the client or the server?

How does VAD interact with echo cancellation?

Can VAD distinguish between speech and non-speech sounds like coughing or typing?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding