Building a Voice UI for AI Agents: Microphone Input, Waveform Visualization, and Playback

Why Voice Interfaces for Agents

Voice interaction removes the typing bottleneck. Users can describe complex problems, provide context, and issue multi-step instructions faster through speech than text. Building a voice UI for an AI agent requires three capabilities: capturing microphone input, visualizing audio in real-time, and playing back agent audio responses.

Requesting Microphone Access

The Web Audio API requires explicit user permission. Wrap the permission request in a hook that tracks the microphone state.

import { useState, useCallback, useRef } from "react";

type MicStatus = "idle" | "requesting" | "active" | "denied" | "error";

function useMicrophone() {
  const [status, setStatus] = useState<MicStatus>("idle");
  const streamRef = useRef<MediaStream | null>(null);

  const requestAccess = useCallback(async () => {
    setStatus("requesting");
    try {
      const stream = await navigator.mediaDevices.getUserMedia({
        audio: {
          echoCancellation: true,
          noiseSuppression: true,
          sampleRate: 16000,
        },
      });
      streamRef.current = stream;
      setStatus("active");
      return stream;
    } catch (err) {
      const name = (err as DOMException).name;
      setStatus(name === "NotAllowedError" ? "denied" : "error");
      return null;
    }
  }, []);

  const stopMic = useCallback(() => {
    streamRef.current?.getTracks().forEach((t) => t.stop());
    streamRef.current = null;
    setStatus("idle");
  }, []);

  return { status, requestAccess, stopMic, stream: streamRef };
}

The sampleRate: 16000 constraint is important. Most speech-to-text APIs expect 16kHz audio. Requesting it upfront avoids client-side resampling.

Recording Audio with MediaRecorder

The MediaRecorder API captures audio chunks from the microphone stream. Collect chunks in an array and assemble them into a Blob when recording stops.

function useAudioRecorder() {
  const [isRecording, setIsRecording] = useState(false);
  const recorderRef = useRef<MediaRecorder | null>(null);
  const chunksRef = useRef<Blob[]>([]);

  const startRecording = useCallback((stream: MediaStream) => {
    chunksRef.current = [];
    const recorder = new MediaRecorder(stream, {
      mimeType: "audio/webm;codecs=opus",
    });

    recorder.ondataavailable = (e) => {
      if (e.data.size > 0) chunksRef.current.push(e.data);
    };

    recorder.start(250); // Collect data every 250ms
    recorderRef.current = recorder;
    setIsRecording(true);
  }, []);

  const stopRecording = useCallback((): Promise<Blob> => {
    return new Promise((resolve) => {
      const recorder = recorderRef.current;
      if (!recorder) return;

      recorder.onstop = () => {
        const blob = new Blob(chunksRef.current, {
          type: "audio/webm",
        });
        resolve(blob);
      };

      recorder.stop();
      setIsRecording(false);
    });
  }, []);

  return { isRecording, startRecording, stopRecording };
}

The 250ms interval in recorder.start(250) provides a good balance between responsiveness and efficiency. Smaller intervals create more chunks but allow for lower-latency streaming to the server.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Real-Time Waveform Visualization

A waveform gives visual feedback that audio is being captured. Use an AnalyserNode from the Web Audio API and draw the waveform on a Canvas element.

import { useEffect, useRef } from "react";

function WaveformVisualizer({
  stream,
  isActive,
}: {
  stream: MediaStream | null;
  isActive: boolean;
}) {
  const canvasRef = useRef<HTMLCanvasElement>(null);

  useEffect(() => {
    if (!stream || !isActive || !canvasRef.current) return;

    const audioCtx = new AudioContext();
    const analyser = audioCtx.createAnalyser();
    analyser.fftSize = 256;
    const source = audioCtx.createMediaStreamSource(stream);
    source.connect(analyser);

    const canvas = canvasRef.current;
    const ctx = canvas.getContext("2d")!;
    const bufferLength = analyser.frequencyBinCount;
    const dataArray = new Uint8Array(bufferLength);
    let animId: number;

    function draw() {
      animId = requestAnimationFrame(draw);
      analyser.getByteTimeDomainData(dataArray);

      ctx.fillStyle = "#f9fafb";
      ctx.fillRect(0, 0, canvas.width, canvas.height);
      ctx.lineWidth = 2;
      ctx.strokeStyle = "#3b82f6";
      ctx.beginPath();

      const sliceWidth = canvas.width / bufferLength;
      let x = 0;

      for (let i = 0; i < bufferLength; i++) {
        const v = dataArray[i] / 128.0;
        const y = (v * canvas.height) / 2;
        if (i === 0) ctx.moveTo(x, y);
        else ctx.lineTo(x, y);
        x += sliceWidth;
      }

      ctx.lineTo(canvas.width, canvas.height / 2);
      ctx.stroke();
    }

    draw();

    return () => {
      cancelAnimationFrame(animId);
      source.disconnect();
      audioCtx.close();
    };
  }, [stream, isActive]);

  return (
    <canvas
      ref={canvasRef}
      width={300}
      height={80}
      className="rounded-lg border"
    />
  );
}

Audio Playback for Agent Responses

When the agent returns an audio response, create an Audio element and manage playback state.

function useAudioPlayback() {
  const [isPlaying, setIsPlaying] = useState(false);
  const audioRef = useRef<HTMLAudioElement | null>(null);

  const play = useCallback((audioUrl: string) => {
    const audio = new Audio(audioUrl);
    audioRef.current = audio;
    audio.onended = () => setIsPlaying(false);
    audio.play();
    setIsPlaying(true);
  }, []);

  const stop = useCallback(() => {
    audioRef.current?.pause();
    audioRef.current = null;
    setIsPlaying(false);
  }, []);

  return { isPlaying, play, stop };
}

Putting It All Together

Combine the hooks into a voice interaction component with record, send, and playback controls.

function VoiceAgentUI() {
  const mic = useMicrophone();
  const recorder = useAudioRecorder();
  const playback = useAudioPlayback();

  const handleRecord = async () => {
    const stream = await mic.requestAccess();
    if (stream) recorder.startRecording(stream);
  };

  const handleStop = async () => {
    const blob = await recorder.stopRecording();
    mic.stopMic();
    // Send blob to your agent API
    const formData = new FormData();
    formData.append("audio", blob, "recording.webm");
    const res = await fetch("/api/agent/voice", {
      method: "POST",
      body: formData,
    });
    const { audioUrl } = await res.json();
    playback.play(audioUrl);
  };

  return (
    <div className="flex flex-col items-center gap-4 p-6">
      <WaveformVisualizer
        stream={mic.stream.current}
        isActive={recorder.isRecording}
      />
      <button
        onClick={recorder.isRecording ? handleStop : handleRecord}
        className={`w-16 h-16 rounded-full ${
          recorder.isRecording ? "bg-red-500" : "bg-blue-600"
        } text-white`}
      >
        {recorder.isRecording ? "Stop" : "Mic"}
      </button>
    </div>
  );
}

FAQ

What audio format should I send to the speech-to-text API?

Most APIs accept audio/webm with Opus codec, which is what MediaRecorder produces by default in Chrome and Firefox. If your API requires WAV or PCM, use a library like audiobuffer-to-wav to convert the recorded blob before sending.

How do I handle the microphone permission prompt appearing multiple times?

The browser remembers permission grants per origin. If you serve your app over HTTPS, the user only sees the prompt once unless they explicitly revoke it. On localhost during development the prompt may reappear. Check navigator.permissions.query({ name: "microphone" }) to determine the current permission state before calling getUserMedia.

Can I stream audio to the agent in real-time instead of recording first?

Yes. Use the ondataavailable callback with a short interval (100-250ms) and send each chunk to a WebSocket endpoint as it arrives. This enables real-time speech-to-text and reduces perceived latency because the agent starts processing before the user finishes speaking.

#VoiceUI #MediaRecorderAPI #AudioVisualization #React #AIAgentInterface #AgenticAI #LearnAI #AIEngineering

Building a Voice UI for AI Agents: Microphone Input, Waveform Visualization, and Playback

Why Voice Interfaces for Agents

Requesting Microphone Access

Recording Audio with MediaRecorder

Real-Time Waveform Visualization

Audio Playback for Agent Responses

Putting It All Together

FAQ

What audio format should I send to the speech-to-text API?

How do I handle the microphone permission prompt appearing multiple times?

Can I stream audio to the agent in real-time instead of recording first?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding