Audio Preprocessing for Voice Agents: Noise Reduction, Echo Cancellation, and Normalization
Build a complete audio preprocessing pipeline for voice AI agents — covering noise reduction, echo cancellation, gain normalization, and both client-side Web Audio API and server-side Python implementations.
Why Preprocessing Matters
Raw microphone audio is messy. It contains background noise (fans, traffic, other conversations), echo from the agent's own speech playing through speakers, volume inconsistencies (some users speak quietly, others shout), and room reverberation. Feeding raw audio directly to your STT engine degrades transcription accuracy and produces unreliable results.
A well-designed preprocessing pipeline cleans the audio before it reaches the STT engine, dramatically improving word accuracy and reducing hallucinated transcriptions. The goal is to deliver clean, normalized speech at a consistent volume level.
Client-Side Preprocessing with Web Audio API
The browser's Web Audio API lets you process audio in real time before sending it to the server. This reduces bandwidth and offloads processing from your backend.
class AudioPreprocessor {
constructor() {
this.audioContext = null;
this.sourceNode = null;
this.processorNode = null;
}
async init(stream) {
this.audioContext = new AudioContext({ sampleRate: 16000 });
this.sourceNode = this.audioContext.createMediaStreamSource(stream);
// High-pass filter to remove low-frequency rumble (below 80Hz)
const highPass = this.audioContext.createBiquadFilter();
highPass.type = 'highpass';
highPass.frequency.value = 80;
highPass.Q.value = 0.7;
// Low-pass filter to remove high-frequency hiss (above 8kHz)
const lowPass = this.audioContext.createBiquadFilter();
lowPass.type = 'lowpass';
lowPass.frequency.value = 8000;
lowPass.Q.value = 0.7;
// Compressor for volume normalization
const compressor = this.audioContext.createDynamicsCompressor();
compressor.threshold.value = -30; // Start compressing at -30dB
compressor.knee.value = 10;
compressor.ratio.value = 4; // 4:1 compression ratio
compressor.attack.value = 0.005; // 5ms attack
compressor.release.value = 0.1; // 100ms release
// Gain to boost after compression
const gainNode = this.audioContext.createGain();
gainNode.gain.value = 1.5;
// Connect the chain
this.sourceNode
.connect(highPass)
.connect(lowPass)
.connect(compressor)
.connect(gainNode);
return gainNode;
}
getProcessedStream(gainNode) {
const destination = this.audioContext.createMediaStreamDestination();
gainNode.connect(destination);
return destination.stream;
}
}
// Usage
const rawStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const preprocessor = new AudioPreprocessor();
const outputNode = await preprocessor.init(rawStream);
const cleanStream = preprocessor.getProcessedStream(outputNode);
// Use cleanStream for WebRTC or recording
AudioWorklet for Advanced Processing
For more sophisticated processing like spectral noise reduction, use an AudioWorklet. This runs in a separate thread so it does not block the main UI.
// noise-suppressor-worklet.js
class NoiseSuppressorProcessor extends AudioWorkletProcessor {
constructor() {
super();
this.noiseFloor = new Float32Array(128).fill(0.001);
this.alpha = 0.98; // Smoothing factor for noise estimation
}
process(inputs, outputs) {
const input = inputs[0][0];
const output = outputs[0][0];
if (!input) return true;
for (let i = 0; i < input.length; i++) {
const magnitude = Math.abs(input[i]);
// Update noise floor estimate during silence
if (magnitude < this.noiseFloor[i % 128] * 3) {
this.noiseFloor[i % 128] =
this.alpha * this.noiseFloor[i % 128] +
(1 - this.alpha) * magnitude;
}
// Spectral subtraction: reduce signal by noise estimate
const noiseEst = this.noiseFloor[i % 128] * 2;
if (magnitude > noiseEst) {
output[i] = input[i] * (1 - noiseEst / magnitude);
} else {
output[i] = input[i] * 0.05; // Soft gate, don't zero out
}
}
return true;
}
}
registerProcessor('noise-suppressor', NoiseSuppressorProcessor);
Register and use the worklet in your main code:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
await audioContext.audioWorklet.addModule('noise-suppressor-worklet.js');
const suppressorNode = new AudioWorkletNode(audioContext, 'noise-suppressor');
// Insert into the processing chain
sourceNode.connect(suppressorNode).connect(compressor);
Server-Side Preprocessing with Python
When you need more powerful noise reduction than what the browser can provide, process audio on the server using libraries like noisereduce and scipy.
import numpy as np
import noisereduce as nr
from scipy.signal import butter, sosfilt
from scipy.io import wavfile
class ServerAudioPreprocessor:
def __init__(self, sample_rate: int = 16000):
self.sample_rate = sample_rate
self.target_rms = 0.1 # Target RMS for normalization
def preprocess(self, audio: np.ndarray) -> np.ndarray:
"""Full preprocessing pipeline."""
audio = audio.astype(np.float32)
if audio.max() > 1.0:
audio = audio / 32768.0 # Convert int16 to float
audio = self._bandpass_filter(audio, low=80, high=8000)
audio = self._reduce_noise(audio)
audio = self._normalize(audio)
audio = self._trim_silence(audio)
return audio
def _bandpass_filter(
self, audio: np.ndarray, low: int, high: int
) -> np.ndarray:
sos = butter(
5, [low, high], btype='band',
fs=self.sample_rate, output='sos',
)
return sosfilt(sos, audio)
def _reduce_noise(self, audio: np.ndarray) -> np.ndarray:
return nr.reduce_noise(
y=audio,
sr=self.sample_rate,
stationary=False, # Non-stationary noise (better for real-world)
prop_decrease=0.8, # Reduce noise by 80%
n_fft=512,
hop_length=128,
)
def _normalize(self, audio: np.ndarray) -> np.ndarray:
rms = np.sqrt(np.mean(audio ** 2))
if rms > 0:
audio = audio * (self.target_rms / rms)
return np.clip(audio, -1.0, 1.0)
def _trim_silence(
self, audio: np.ndarray, threshold: float = 0.01
) -> np.ndarray:
mask = np.abs(audio) > threshold
if not mask.any():
return audio
first = mask.argmax()
last = len(mask) - mask[::-1].argmax()
# Keep small padding
pad = int(0.05 * self.sample_rate)
return audio[max(0, first - pad):min(len(audio), last + pad)]
# Usage
preprocessor = ServerAudioPreprocessor(sample_rate=16000)
sample_rate, raw_audio = wavfile.read("recording.wav")
clean_audio = preprocessor.preprocess(raw_audio)
Echo Cancellation
Echo cancellation removes the agent's own voice from the user's microphone input. The browser handles this when you enable echoCancellation: true in getUserMedia. For server-side echo cancellation, you need the agent's output audio as a reference signal.
from scipy.signal import fftconvolve
class SimpleAEC:
"""Simplified Acoustic Echo Cancellation using cross-correlation."""
def __init__(self, filter_length: int = 4096):
self.filter_length = filter_length
self.filter_coeffs = np.zeros(filter_length)
self.mu = 0.01 # Learning rate
def cancel_echo(
self, mic_signal: np.ndarray, ref_signal: np.ndarray
) -> np.ndarray:
"""Remove echo of ref_signal from mic_signal."""
n = len(mic_signal)
output = np.zeros(n)
for i in range(self.filter_length, n):
ref_chunk = ref_signal[i - self.filter_length:i][::-1]
echo_estimate = np.dot(self.filter_coeffs, ref_chunk)
error = mic_signal[i] - echo_estimate
output[i] = error
# Adaptive filter update (NLMS)
power = np.dot(ref_chunk, ref_chunk) + 1e-10
self.filter_coeffs += self.mu * error * ref_chunk / power
return output
In practice, WebRTC's built-in AEC is far more sophisticated and handles non-linear echo, double-talk, and dynamic room conditions. Use it whenever possible.
FAQ
Should I preprocess audio on the client or the server?
Do both. Client-side preprocessing (filtering, compression, gain) reduces bandwidth and gives the server cleaner input. Server-side preprocessing (noise reduction, echo cancellation) handles the heavy lifting. This layered approach is standard in production voice systems. The browser's built-in audio constraints (echoCancellation, noiseSuppression, autoGainControl) provide a solid baseline that handles 80% of cases.
Does preprocessing degrade STT accuracy?
Aggressive preprocessing can remove speech content along with noise, particularly overly aggressive noise reduction or narrow bandpass filters. The key is to tune your preprocessing parameters on representative audio samples and measure the STT word error rate before and after. In most cases, well-tuned preprocessing improves STT accuracy by 10-30% compared to raw audio.
How do I handle audio from different microphone types?
Different microphones (laptop built-in, USB headset, phone) have vastly different frequency responses and sensitivity levels. Normalization is the key — apply automatic gain control to bring all inputs to a consistent RMS level. The compressor in the Web Audio API chain handles this well. Additionally, the bandpass filter removes frequencies that are outside the speech range regardless of microphone type.
#AudioPreprocessing #NoiseReduction #EchoCancellation #WebAudioAPI #VoiceAI #SignalProcessing #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.