Building a Voice UI for AI Agents: Microphone Input, Waveform Visualization, and Playback
Implement a voice interface for AI agents using the MediaRecorder API, real-time audio waveform visualization with Canvas, and audio playback controls in React.
Why Voice Interfaces for Agents
Voice interaction removes the typing bottleneck. Users can describe complex problems, provide context, and issue multi-step instructions faster through speech than text. Building a voice UI for an AI agent requires three capabilities: capturing microphone input, visualizing audio in real-time, and playing back agent audio responses.
Requesting Microphone Access
The Web Audio API requires explicit user permission. Wrap the permission request in a hook that tracks the microphone state.
import { useState, useCallback, useRef } from "react";
type MicStatus = "idle" | "requesting" | "active" | "denied" | "error";
function useMicrophone() {
const [status, setStatus] = useState<MicStatus>("idle");
const streamRef = useRef<MediaStream | null>(null);
const requestAccess = useCallback(async () => {
setStatus("requesting");
try {
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
sampleRate: 16000,
},
});
streamRef.current = stream;
setStatus("active");
return stream;
} catch (err) {
const name = (err as DOMException).name;
setStatus(name === "NotAllowedError" ? "denied" : "error");
return null;
}
}, []);
const stopMic = useCallback(() => {
streamRef.current?.getTracks().forEach((t) => t.stop());
streamRef.current = null;
setStatus("idle");
}, []);
return { status, requestAccess, stopMic, stream: streamRef };
}
The sampleRate: 16000 constraint is important. Most speech-to-text APIs expect 16kHz audio. Requesting it upfront avoids client-side resampling.
Recording Audio with MediaRecorder
The MediaRecorder API captures audio chunks from the microphone stream. Collect chunks in an array and assemble them into a Blob when recording stops.
function useAudioRecorder() {
const [isRecording, setIsRecording] = useState(false);
const recorderRef = useRef<MediaRecorder | null>(null);
const chunksRef = useRef<Blob[]>([]);
const startRecording = useCallback((stream: MediaStream) => {
chunksRef.current = [];
const recorder = new MediaRecorder(stream, {
mimeType: "audio/webm;codecs=opus",
});
recorder.ondataavailable = (e) => {
if (e.data.size > 0) chunksRef.current.push(e.data);
};
recorder.start(250); // Collect data every 250ms
recorderRef.current = recorder;
setIsRecording(true);
}, []);
const stopRecording = useCallback((): Promise<Blob> => {
return new Promise((resolve) => {
const recorder = recorderRef.current;
if (!recorder) return;
recorder.onstop = () => {
const blob = new Blob(chunksRef.current, {
type: "audio/webm",
});
resolve(blob);
};
recorder.stop();
setIsRecording(false);
});
}, []);
return { isRecording, startRecording, stopRecording };
}
The 250ms interval in recorder.start(250) provides a good balance between responsiveness and efficiency. Smaller intervals create more chunks but allow for lower-latency streaming to the server.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Real-Time Waveform Visualization
A waveform gives visual feedback that audio is being captured. Use an AnalyserNode from the Web Audio API and draw the waveform on a Canvas element.
import { useEffect, useRef } from "react";
function WaveformVisualizer({
stream,
isActive,
}: {
stream: MediaStream | null;
isActive: boolean;
}) {
const canvasRef = useRef<HTMLCanvasElement>(null);
useEffect(() => {
if (!stream || !isActive || !canvasRef.current) return;
const audioCtx = new AudioContext();
const analyser = audioCtx.createAnalyser();
analyser.fftSize = 256;
const source = audioCtx.createMediaStreamSource(stream);
source.connect(analyser);
const canvas = canvasRef.current;
const ctx = canvas.getContext("2d")!;
const bufferLength = analyser.frequencyBinCount;
const dataArray = new Uint8Array(bufferLength);
let animId: number;
function draw() {
animId = requestAnimationFrame(draw);
analyser.getByteTimeDomainData(dataArray);
ctx.fillStyle = "#f9fafb";
ctx.fillRect(0, 0, canvas.width, canvas.height);
ctx.lineWidth = 2;
ctx.strokeStyle = "#3b82f6";
ctx.beginPath();
const sliceWidth = canvas.width / bufferLength;
let x = 0;
for (let i = 0; i < bufferLength; i++) {
const v = dataArray[i] / 128.0;
const y = (v * canvas.height) / 2;
if (i === 0) ctx.moveTo(x, y);
else ctx.lineTo(x, y);
x += sliceWidth;
}
ctx.lineTo(canvas.width, canvas.height / 2);
ctx.stroke();
}
draw();
return () => {
cancelAnimationFrame(animId);
source.disconnect();
audioCtx.close();
};
}, [stream, isActive]);
return (
<canvas
ref={canvasRef}
width={300}
height={80}
className="rounded-lg border"
/>
);
}
Audio Playback for Agent Responses
When the agent returns an audio response, create an Audio element and manage playback state.
function useAudioPlayback() {
const [isPlaying, setIsPlaying] = useState(false);
const audioRef = useRef<HTMLAudioElement | null>(null);
const play = useCallback((audioUrl: string) => {
const audio = new Audio(audioUrl);
audioRef.current = audio;
audio.onended = () => setIsPlaying(false);
audio.play();
setIsPlaying(true);
}, []);
const stop = useCallback(() => {
audioRef.current?.pause();
audioRef.current = null;
setIsPlaying(false);
}, []);
return { isPlaying, play, stop };
}
Putting It All Together
Combine the hooks into a voice interaction component with record, send, and playback controls.
function VoiceAgentUI() {
const mic = useMicrophone();
const recorder = useAudioRecorder();
const playback = useAudioPlayback();
const handleRecord = async () => {
const stream = await mic.requestAccess();
if (stream) recorder.startRecording(stream);
};
const handleStop = async () => {
const blob = await recorder.stopRecording();
mic.stopMic();
// Send blob to your agent API
const formData = new FormData();
formData.append("audio", blob, "recording.webm");
const res = await fetch("/api/agent/voice", {
method: "POST",
body: formData,
});
const { audioUrl } = await res.json();
playback.play(audioUrl);
};
return (
<div className="flex flex-col items-center gap-4 p-6">
<WaveformVisualizer
stream={mic.stream.current}
isActive={recorder.isRecording}
/>
<button
onClick={recorder.isRecording ? handleStop : handleRecord}
className={`w-16 h-16 rounded-full ${
recorder.isRecording ? "bg-red-500" : "bg-blue-600"
} text-white`}
>
{recorder.isRecording ? "Stop" : "Mic"}
</button>
</div>
);
}
FAQ
What audio format should I send to the speech-to-text API?
Most APIs accept audio/webm with Opus codec, which is what MediaRecorder produces by default in Chrome and Firefox. If your API requires WAV or PCM, use a library like audiobuffer-to-wav to convert the recorded blob before sending.
How do I handle the microphone permission prompt appearing multiple times?
The browser remembers permission grants per origin. If you serve your app over HTTPS, the user only sees the prompt once unless they explicitly revoke it. On localhost during development the prompt may reappear. Check navigator.permissions.query({ name: "microphone" }) to determine the current permission state before calling getUserMedia.
Can I stream audio to the agent in real-time instead of recording first?
Yes. Use the ondataavailable callback with a short interval (100-250ms) and send each chunk to a WebSocket endpoint as it arrives. This enables real-time speech-to-text and reduces perceived latency because the agent starts processing before the user finishes speaking.
#VoiceUI #MediaRecorderAPI #AudioVisualization #React #AIAgentInterface #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.