Raspberry Pi AI Agent: Building a Hardware-Based Voice Assistant

What You Will Build

In this guide, you will build a standalone AI voice assistant running entirely on a Raspberry Pi 5. It listens for a wake word, transcribes your speech locally, processes the request through a small language model, and responds with synthesized speech — all without cloud API calls.

Hardware Requirements

Raspberry Pi 5 (8 GB RAM recommended, 4 GB minimum)
USB microphone or a ReSpeaker HAT for higher quality audio
Speaker connected via 3.5mm jack or USB
MicroSD card (64 GB or larger for model storage)
Power supply (USB-C, 5V 5A for Pi 5)

Software Setup

Start with a clean Raspberry Pi OS (64-bit) installation:

# Update system
sudo apt update && sudo apt upgrade -y

# Install audio dependencies
sudo apt install -y portaudio19-dev python3-pyaudio espeak-ng libespeak-ng-dev

# Install Python dependencies
pip install numpy onnxruntime openwakeword faster-whisper piper-tts

# Verify audio devices
python3 -c "import pyaudio; p = pyaudio.PyAudio(); print(p.get_default_input_device_info())"

Wake Word Detection

The agent needs to listen continuously but only process speech after hearing a wake word. The openwakeword library provides lightweight wake word models:

import pyaudio
import numpy as np
from openwakeword.model import Model as WakeModel

class WakeWordListener:
    """Listens for a wake word using openwakeword."""

    CHUNK = 1280  # 80ms at 16kHz
    RATE = 16000
    FORMAT = pyaudio.paInt16

    def __init__(self, threshold: float = 0.5):
        self.model = WakeModel(
            wakeword_models=["hey_jarvis"],
            inference_framework="onnx",
        )
        self.threshold = threshold
        self.audio = pyaudio.PyAudio()

    def listen_for_wake_word(self) -> bool:
        """Block until wake word is detected."""
        stream = self.audio.open(
            format=self.FORMAT,
            channels=1,
            rate=self.RATE,
            input=True,
            frames_per_buffer=self.CHUNK,
        )

        print("Listening for wake word...")
        try:
            while True:
                audio_data = stream.read(self.CHUNK, exception_on_overflow=False)
                audio_array = np.frombuffer(audio_data, dtype=np.int16)
                prediction = self.model.predict(audio_array)

                for model_name, score in prediction.items():
                    if score > self.threshold:
                        print(f"Wake word detected! (score: {score:.2f})")
                        return True
        finally:
            stream.stop_stream()
            stream.close()

Speech-to-Text with Faster Whisper

After the wake word triggers, capture and transcribe the user's speech:

import io
import wave
import numpy as np
from faster_whisper import WhisperModel

class SpeechRecognizer:
    """Transcribes speech using Whisper running locally."""

    def __init__(self, model_size: str = "tiny.en"):
        # tiny.en is ~75MB, runs fast on Pi 5
        # Use "base.en" (~140MB) for better accuracy
        self.model = WhisperModel(
            model_size,
            device="cpu",
            compute_type="int8",
        )
        self.audio = pyaudio.PyAudio()

    def record_until_silence(
        self,
        silence_threshold: int = 500,
        silence_duration: float = 1.5,
        max_duration: float = 15.0,
    ) -> np.ndarray:
        """Record audio until silence is detected."""
        stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=16000,
            input=True,
            frames_per_buffer=1024,
        )

        frames = []
        silent_chunks = 0
        max_chunks = int(max_duration * 16000 / 1024)
        silence_limit = int(silence_duration * 16000 / 1024)

        print("Recording... speak now")
        for _ in range(max_chunks):
            data = stream.read(1024, exception_on_overflow=False)
            frames.append(np.frombuffer(data, dtype=np.int16))

            amplitude = np.abs(frames[-1]).mean()
            if amplitude < silence_threshold:
                silent_chunks += 1
            else:
                silent_chunks = 0

            if silent_chunks >= silence_limit and len(frames) > 10:
                break

        stream.stop_stream()
        stream.close()
        print("Recording complete")
        return np.concatenate(frames).astype(np.float32) / 32768.0

    def transcribe(self, audio: np.ndarray) -> str:
        segments, _ = self.model.transcribe(audio, language="en")
        return " ".join(seg.text.strip() for seg in segments)

Agent Brain: Local Language Model

For the reasoning engine, use a small ONNX-optimized model. On a Pi 5 with 8 GB RAM, a 1 to 2 billion parameter quantized model fits:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import json

class PiAgent:
    """Simple agent with tool-calling capabilities."""

    def __init__(self, model_runner):
        self.model = model_runner
        self.tools = {}
        self.conversation = []

    def register_tool(self, name: str, description: str, handler):
        self.tools[name] = {"description": description, "handler": handler}

    def process(self, user_text: str) -> str:
        self.conversation.append({"role": "user", "content": user_text})

        tools_desc = "\n".join(
            f"- {name}: {t['description']}" for name, t in self.tools.items()
        )
        system_prompt = (
            "You are a helpful voice assistant running on a Raspberry Pi. "
            "Keep responses short — under 2 sentences. "
            f"Available tools:\n{tools_desc}\n"
            "To use a tool, respond with: TOOL:tool_name:argument"
        )

        response = self.model.generate(system_prompt, self.conversation)

        # Check if the model wants to use a tool
        if response.startswith("TOOL:"):
            parts = response.split(":", 2)
            tool_name = parts[1]
            tool_arg = parts[2] if len(parts) > 2 else ""

            if tool_name in self.tools:
                tool_result = self.tools[tool_name]["handler"](tool_arg)
                return f"Done. {tool_result}"
            return f"I do not have a tool called {tool_name}."

        self.conversation.append({"role": "assistant", "content": response})
        return response

Text-to-Speech with Piper

Convert the agent's response back to audio:

import subprocess
import tempfile

class TextToSpeech:
    """Synthesize speech using Piper TTS (runs locally on Pi)."""

    def __init__(self, model: str = "en_US-lessac-medium"):
        self.model = model

    def speak(self, text: str):
        """Generate and play speech."""
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
            temp_path = f.name

        subprocess.run(
            [
                "piper",
                "--model", self.model,
                "--output_file", temp_path,
            ],
            input=text.encode(),
            capture_output=True,
        )
        subprocess.run(["aplay", temp_path])

    def speak_streaming(self, text: str):
        """Stream speech output directly to audio device."""
        piper = subprocess.Popen(
            ["piper", "--model", self.model, "--output-raw"],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
        )
        aplay = subprocess.Popen(
            ["aplay", "-r", "22050", "-f", "S16_LE", "-c", "1"],
            stdin=piper.stdout,
        )
        piper.stdin.write(text.encode())
        piper.stdin.close()
        aplay.wait()

Putting It All Together

def main():
    wake = WakeWordListener(threshold=0.6)
    stt = SpeechRecognizer(model_size="tiny.en")
    tts = TextToSpeech()
    agent = PiAgent(model_runner=LocalONNXModel("phi2_q4.onnx"))

    # Register tools
    agent.register_tool(
        "lights", "Control smart lights (on/off)",
        lambda arg: toggle_lights(arg),
    )
    agent.register_tool(
        "weather", "Get current weather",
        lambda arg: get_weather_cached(),
    )

    print("Pi Agent ready!")
    while True:
        wake.listen_for_wake_word()
        tts.speak("Yes?")
        audio = stt.record_until_silence()
        user_text = stt.transcribe(audio)
        print(f"User said: {user_text}")

        response = agent.process(user_text)
        print(f"Agent: {response}")
        tts.speak(response)

FAQ

Which Raspberry Pi model should I use for a voice agent?

The Raspberry Pi 5 with 8 GB RAM is the recommended choice. It has enough memory for a quantized 1 to 2 billion parameter model plus the STT and TTS models running concurrently. The Pi 4 with 8 GB works for smaller models but inference is noticeably slower. The Pi 4 with 4 GB can handle only the lightest models (Whisper tiny plus a classifier).

How fast is speech-to-text on a Raspberry Pi?

Whisper tiny.en transcribes a 5-second audio clip in about 1 to 2 seconds on a Pi 5. Whisper base.en takes 3 to 5 seconds for the same clip. For real-time conversational feel, stick with tiny.en and accept slightly lower accuracy, or use Whisper small.en if you can tolerate 5 to 8 second transcription times.

Can the Raspberry Pi agent work without any internet connection?

Yes, that is the entire point of this design. The wake word model, Whisper STT, the language model, and Piper TTS all run locally. The only internet dependency would be for tools that call external APIs (like weather). You can pre-cache data for those tools or provide a graceful "I am offline" response.

#RaspberryPi #VoiceAssistant #HardwareAI #EdgeAI #HomeAutomation #Python #AgenticAI #LearnAI #AIEngineering