Raspberry Pi AI Agent: Building a Hardware-Based Voice Assistant
Build a complete voice-controlled AI agent on a Raspberry Pi, covering hardware setup, model selection, audio input/output, wake word detection, and tool integration for home automation.
What You Will Build
In this guide, you will build a standalone AI voice assistant running entirely on a Raspberry Pi 5. It listens for a wake word, transcribes your speech locally, processes the request through a small language model, and responds with synthesized speech — all without cloud API calls.
Hardware Requirements
- Raspberry Pi 5 (8 GB RAM recommended, 4 GB minimum)
- USB microphone or a ReSpeaker HAT for higher quality audio
- Speaker connected via 3.5mm jack or USB
- MicroSD card (64 GB or larger for model storage)
- Power supply (USB-C, 5V 5A for Pi 5)
Software Setup
Start with a clean Raspberry Pi OS (64-bit) installation:
# Update system
sudo apt update && sudo apt upgrade -y
# Install audio dependencies
sudo apt install -y portaudio19-dev python3-pyaudio espeak-ng libespeak-ng-dev
# Install Python dependencies
pip install numpy onnxruntime openwakeword faster-whisper piper-tts
# Verify audio devices
python3 -c "import pyaudio; p = pyaudio.PyAudio(); print(p.get_default_input_device_info())"
Wake Word Detection
The agent needs to listen continuously but only process speech after hearing a wake word. The openwakeword library provides lightweight wake word models:
import pyaudio
import numpy as np
from openwakeword.model import Model as WakeModel
class WakeWordListener:
"""Listens for a wake word using openwakeword."""
CHUNK = 1280 # 80ms at 16kHz
RATE = 16000
FORMAT = pyaudio.paInt16
def __init__(self, threshold: float = 0.5):
self.model = WakeModel(
wakeword_models=["hey_jarvis"],
inference_framework="onnx",
)
self.threshold = threshold
self.audio = pyaudio.PyAudio()
def listen_for_wake_word(self) -> bool:
"""Block until wake word is detected."""
stream = self.audio.open(
format=self.FORMAT,
channels=1,
rate=self.RATE,
input=True,
frames_per_buffer=self.CHUNK,
)
print("Listening for wake word...")
try:
while True:
audio_data = stream.read(self.CHUNK, exception_on_overflow=False)
audio_array = np.frombuffer(audio_data, dtype=np.int16)
prediction = self.model.predict(audio_array)
for model_name, score in prediction.items():
if score > self.threshold:
print(f"Wake word detected! (score: {score:.2f})")
return True
finally:
stream.stop_stream()
stream.close()
Speech-to-Text with Faster Whisper
After the wake word triggers, capture and transcribe the user's speech:
import io
import wave
import numpy as np
from faster_whisper import WhisperModel
class SpeechRecognizer:
"""Transcribes speech using Whisper running locally."""
def __init__(self, model_size: str = "tiny.en"):
# tiny.en is ~75MB, runs fast on Pi 5
# Use "base.en" (~140MB) for better accuracy
self.model = WhisperModel(
model_size,
device="cpu",
compute_type="int8",
)
self.audio = pyaudio.PyAudio()
def record_until_silence(
self,
silence_threshold: int = 500,
silence_duration: float = 1.5,
max_duration: float = 15.0,
) -> np.ndarray:
"""Record audio until silence is detected."""
stream = self.audio.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=1024,
)
frames = []
silent_chunks = 0
max_chunks = int(max_duration * 16000 / 1024)
silence_limit = int(silence_duration * 16000 / 1024)
print("Recording... speak now")
for _ in range(max_chunks):
data = stream.read(1024, exception_on_overflow=False)
frames.append(np.frombuffer(data, dtype=np.int16))
amplitude = np.abs(frames[-1]).mean()
if amplitude < silence_threshold:
silent_chunks += 1
else:
silent_chunks = 0
if silent_chunks >= silence_limit and len(frames) > 10:
break
stream.stop_stream()
stream.close()
print("Recording complete")
return np.concatenate(frames).astype(np.float32) / 32768.0
def transcribe(self, audio: np.ndarray) -> str:
segments, _ = self.model.transcribe(audio, language="en")
return " ".join(seg.text.strip() for seg in segments)
Agent Brain: Local Language Model
For the reasoning engine, use a small ONNX-optimized model. On a Pi 5 with 8 GB RAM, a 1 to 2 billion parameter quantized model fits:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import json
class PiAgent:
"""Simple agent with tool-calling capabilities."""
def __init__(self, model_runner):
self.model = model_runner
self.tools = {}
self.conversation = []
def register_tool(self, name: str, description: str, handler):
self.tools[name] = {"description": description, "handler": handler}
def process(self, user_text: str) -> str:
self.conversation.append({"role": "user", "content": user_text})
tools_desc = "\n".join(
f"- {name}: {t['description']}" for name, t in self.tools.items()
)
system_prompt = (
"You are a helpful voice assistant running on a Raspberry Pi. "
"Keep responses short — under 2 sentences. "
f"Available tools:\n{tools_desc}\n"
"To use a tool, respond with: TOOL:tool_name:argument"
)
response = self.model.generate(system_prompt, self.conversation)
# Check if the model wants to use a tool
if response.startswith("TOOL:"):
parts = response.split(":", 2)
tool_name = parts[1]
tool_arg = parts[2] if len(parts) > 2 else ""
if tool_name in self.tools:
tool_result = self.tools[tool_name]["handler"](tool_arg)
return f"Done. {tool_result}"
return f"I do not have a tool called {tool_name}."
self.conversation.append({"role": "assistant", "content": response})
return response
Text-to-Speech with Piper
Convert the agent's response back to audio:
import subprocess
import tempfile
class TextToSpeech:
"""Synthesize speech using Piper TTS (runs locally on Pi)."""
def __init__(self, model: str = "en_US-lessac-medium"):
self.model = model
def speak(self, text: str):
"""Generate and play speech."""
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
temp_path = f.name
subprocess.run(
[
"piper",
"--model", self.model,
"--output_file", temp_path,
],
input=text.encode(),
capture_output=True,
)
subprocess.run(["aplay", temp_path])
def speak_streaming(self, text: str):
"""Stream speech output directly to audio device."""
piper = subprocess.Popen(
["piper", "--model", self.model, "--output-raw"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
)
aplay = subprocess.Popen(
["aplay", "-r", "22050", "-f", "S16_LE", "-c", "1"],
stdin=piper.stdout,
)
piper.stdin.write(text.encode())
piper.stdin.close()
aplay.wait()
Putting It All Together
def main():
wake = WakeWordListener(threshold=0.6)
stt = SpeechRecognizer(model_size="tiny.en")
tts = TextToSpeech()
agent = PiAgent(model_runner=LocalONNXModel("phi2_q4.onnx"))
# Register tools
agent.register_tool(
"lights", "Control smart lights (on/off)",
lambda arg: toggle_lights(arg),
)
agent.register_tool(
"weather", "Get current weather",
lambda arg: get_weather_cached(),
)
print("Pi Agent ready!")
while True:
wake.listen_for_wake_word()
tts.speak("Yes?")
audio = stt.record_until_silence()
user_text = stt.transcribe(audio)
print(f"User said: {user_text}")
response = agent.process(user_text)
print(f"Agent: {response}")
tts.speak(response)
FAQ
Which Raspberry Pi model should I use for a voice agent?
The Raspberry Pi 5 with 8 GB RAM is the recommended choice. It has enough memory for a quantized 1 to 2 billion parameter model plus the STT and TTS models running concurrently. The Pi 4 with 8 GB works for smaller models but inference is noticeably slower. The Pi 4 with 4 GB can handle only the lightest models (Whisper tiny plus a classifier).
How fast is speech-to-text on a Raspberry Pi?
Whisper tiny.en transcribes a 5-second audio clip in about 1 to 2 seconds on a Pi 5. Whisper base.en takes 3 to 5 seconds for the same clip. For real-time conversational feel, stick with tiny.en and accept slightly lower accuracy, or use Whisper small.en if you can tolerate 5 to 8 second transcription times.
Can the Raspberry Pi agent work without any internet connection?
Yes, that is the entire point of this design. The wake word model, Whisper STT, the language model, and Piper TTS all run locally. The only internet dependency would be for tools that call external APIs (like weather). You can pre-cache data for those tools or provide a graceful "I am offline" response.
#RaspberryPi #VoiceAssistant #HardwareAI #EdgeAI #HomeAutomation #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.