Audio Preprocessing for Voice Agents: Noise Reduction, Echo Cancellation, and Normalization

Why Preprocessing Matters

Raw microphone audio is messy. It contains background noise (fans, traffic, other conversations), echo from the agent's own speech playing through speakers, volume inconsistencies (some users speak quietly, others shout), and room reverberation. Feeding raw audio directly to your STT engine degrades transcription accuracy and produces unreliable results.

A well-designed preprocessing pipeline cleans the audio before it reaches the STT engine, dramatically improving word accuracy and reducing hallucinated transcriptions. The goal is to deliver clean, normalized speech at a consistent volume level.

Client-Side Preprocessing with Web Audio API

The browser's Web Audio API lets you process audio in real time before sending it to the server. This reduces bandwidth and offloads processing from your backend.

class AudioPreprocessor {
  constructor() {
    this.audioContext = null;
    this.sourceNode = null;
    this.processorNode = null;
  }

  async init(stream) {
    this.audioContext = new AudioContext({ sampleRate: 16000 });
    this.sourceNode = this.audioContext.createMediaStreamSource(stream);

    // High-pass filter to remove low-frequency rumble (below 80Hz)
    const highPass = this.audioContext.createBiquadFilter();
    highPass.type = 'highpass';
    highPass.frequency.value = 80;
    highPass.Q.value = 0.7;

    // Low-pass filter to remove high-frequency hiss (above 8kHz)
    const lowPass = this.audioContext.createBiquadFilter();
    lowPass.type = 'lowpass';
    lowPass.frequency.value = 8000;
    lowPass.Q.value = 0.7;

    // Compressor for volume normalization
    const compressor = this.audioContext.createDynamicsCompressor();
    compressor.threshold.value = -30;   // Start compressing at -30dB
    compressor.knee.value = 10;
    compressor.ratio.value = 4;         // 4:1 compression ratio
    compressor.attack.value = 0.005;    // 5ms attack
    compressor.release.value = 0.1;     // 100ms release

    // Gain to boost after compression
    const gainNode = this.audioContext.createGain();
    gainNode.gain.value = 1.5;

    // Connect the chain
    this.sourceNode
      .connect(highPass)
      .connect(lowPass)
      .connect(compressor)
      .connect(gainNode);

    return gainNode;
  }

  getProcessedStream(gainNode) {
    const destination = this.audioContext.createMediaStreamDestination();
    gainNode.connect(destination);
    return destination.stream;
  }
}

// Usage
const rawStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const preprocessor = new AudioPreprocessor();
const outputNode = await preprocessor.init(rawStream);
const cleanStream = preprocessor.getProcessedStream(outputNode);
// Use cleanStream for WebRTC or recording

AudioWorklet for Advanced Processing

For more sophisticated processing like spectral noise reduction, use an AudioWorklet. This runs in a separate thread so it does not block the main UI.

// noise-suppressor-worklet.js
class NoiseSuppressorProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this.noiseFloor = new Float32Array(128).fill(0.001);
    this.alpha = 0.98;  // Smoothing factor for noise estimation
  }

  process(inputs, outputs) {
    const input = inputs[0][0];
    const output = outputs[0][0];

    if (!input) return true;

    for (let i = 0; i < input.length; i++) {
      const magnitude = Math.abs(input[i]);

      // Update noise floor estimate during silence
      if (magnitude < this.noiseFloor[i % 128] * 3) {
        this.noiseFloor[i % 128] =
          this.alpha * this.noiseFloor[i % 128] +
          (1 - this.alpha) * magnitude;
      }

      // Spectral subtraction: reduce signal by noise estimate
      const noiseEst = this.noiseFloor[i % 128] * 2;
      if (magnitude > noiseEst) {
        output[i] = input[i] * (1 - noiseEst / magnitude);
      } else {
        output[i] = input[i] * 0.05;  // Soft gate, don't zero out
      }
    }

    return true;
  }
}

registerProcessor('noise-suppressor', NoiseSuppressorProcessor);

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

await audioContext.audioWorklet.addModule('noise-suppressor-worklet.js');
const suppressorNode = new AudioWorkletNode(audioContext, 'noise-suppressor');

// Insert into the processing chain
sourceNode.connect(suppressorNode).connect(compressor);

Server-Side Preprocessing with Python

When you need more powerful noise reduction than what the browser can provide, process audio on the server using libraries like noisereduce and scipy.

import numpy as np
import noisereduce as nr
from scipy.signal import butter, sosfilt
from scipy.io import wavfile

class ServerAudioPreprocessor:
    def __init__(self, sample_rate: int = 16000):
        self.sample_rate = sample_rate
        self.target_rms = 0.1  # Target RMS for normalization

    def preprocess(self, audio: np.ndarray) -> np.ndarray:
        """Full preprocessing pipeline."""
        audio = audio.astype(np.float32)
        if audio.max() > 1.0:
            audio = audio / 32768.0  # Convert int16 to float

        audio = self._bandpass_filter(audio, low=80, high=8000)
        audio = self._reduce_noise(audio)
        audio = self._normalize(audio)
        audio = self._trim_silence(audio)

        return audio

    def _bandpass_filter(
        self, audio: np.ndarray, low: int, high: int
    ) -> np.ndarray:
        sos = butter(
            5, [low, high], btype='band',
            fs=self.sample_rate, output='sos',
        )
        return sosfilt(sos, audio)

    def _reduce_noise(self, audio: np.ndarray) -> np.ndarray:
        return nr.reduce_noise(
            y=audio,
            sr=self.sample_rate,
            stationary=False,   # Non-stationary noise (better for real-world)
            prop_decrease=0.8,  # Reduce noise by 80%
            n_fft=512,
            hop_length=128,
        )

    def _normalize(self, audio: np.ndarray) -> np.ndarray:
        rms = np.sqrt(np.mean(audio ** 2))
        if rms > 0:
            audio = audio * (self.target_rms / rms)
        return np.clip(audio, -1.0, 1.0)

    def _trim_silence(
        self, audio: np.ndarray, threshold: float = 0.01
    ) -> np.ndarray:
        mask = np.abs(audio) > threshold
        if not mask.any():
            return audio
        first = mask.argmax()
        last = len(mask) - mask[::-1].argmax()
        # Keep small padding
        pad = int(0.05 * self.sample_rate)
        return audio[max(0, first - pad):min(len(audio), last + pad)]

# Usage
preprocessor = ServerAudioPreprocessor(sample_rate=16000)
sample_rate, raw_audio = wavfile.read("recording.wav")
clean_audio = preprocessor.preprocess(raw_audio)

Echo Cancellation

Echo cancellation removes the agent's own voice from the user's microphone input. The browser handles this when you enable echoCancellation: true in getUserMedia. For server-side echo cancellation, you need the agent's output audio as a reference signal.

from scipy.signal import fftconvolve

class SimpleAEC:
    """Simplified Acoustic Echo Cancellation using cross-correlation."""

    def __init__(self, filter_length: int = 4096):
        self.filter_length = filter_length
        self.filter_coeffs = np.zeros(filter_length)
        self.mu = 0.01  # Learning rate

    def cancel_echo(
        self, mic_signal: np.ndarray, ref_signal: np.ndarray
    ) -> np.ndarray:
        """Remove echo of ref_signal from mic_signal."""
        n = len(mic_signal)
        output = np.zeros(n)

        for i in range(self.filter_length, n):
            ref_chunk = ref_signal[i - self.filter_length:i][::-1]
            echo_estimate = np.dot(self.filter_coeffs, ref_chunk)
            error = mic_signal[i] - echo_estimate
            output[i] = error

            # Adaptive filter update (NLMS)
            power = np.dot(ref_chunk, ref_chunk) + 1e-10
            self.filter_coeffs += self.mu * error * ref_chunk / power

        return output

In practice, WebRTC's built-in AEC is far more sophisticated and handles non-linear echo, double-talk, and dynamic room conditions. Use it whenever possible.

FAQ

Should I preprocess audio on the client or the server?

Do both. Client-side preprocessing (filtering, compression, gain) reduces bandwidth and gives the server cleaner input. Server-side preprocessing (noise reduction, echo cancellation) handles the heavy lifting. This layered approach is standard in production voice systems. The browser's built-in audio constraints (echoCancellation, noiseSuppression, autoGainControl) provide a solid baseline that handles 80% of cases.

Does preprocessing degrade STT accuracy?

Aggressive preprocessing can remove speech content along with noise, particularly overly aggressive noise reduction or narrow bandpass filters. The key is to tune your preprocessing parameters on representative audio samples and measure the STT word error rate before and after. In most cases, well-tuned preprocessing improves STT accuracy by 10-30% compared to raw audio.

How do I handle audio from different microphone types?

Different microphones (laptop built-in, USB headset, phone) have vastly different frequency responses and sensitivity levels. Normalization is the key — apply automatic gain control to bring all inputs to a consistent RMS level. The compressor in the Web Audio API chain handles this well. Additionally, the bandpass filter removes frequencies that are outside the speech range regardless of microphone type.

#AudioPreprocessing #NoiseReduction #EchoCancellation #WebAudioAPI #VoiceAI #SignalProcessing #AgenticAI #LearnAI #AIEngineering

Audio Preprocessing for Voice Agents: Noise Reduction, Echo Cancellation, and Normalization

Why Preprocessing Matters

Client-Side Preprocessing with Web Audio API

AudioWorklet for Advanced Processing

Server-Side Preprocessing with Python

Echo Cancellation

FAQ

Should I preprocess audio on the client or the server?

Does preprocessing degrade STT accuracy?

How do I handle audio from different microphone types?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding