Building Agents with Gemma and Phi: Small Language Models for Edge Deployment
Deploy AI agents on edge devices using Google's Gemma and Microsoft's Phi small language models. Cover resource requirements, agent patterns for constrained environments, and mobile deployment strategies.
The Case for Small Language Models
Not every agent needs a 70B parameter model. Many practical agent tasks — classification, extraction, simple Q&A, form filling, and basic tool calling — can be handled by models with 2-4 billion parameters. Small Language Models (SLMs) open up deployment scenarios that large models cannot reach: mobile phones, IoT devices, laptops without GPUs, and environments with no internet connectivity.
Google's Gemma and Microsoft's Phi families lead the SLM space. Both deliver surprisingly strong performance relative to their size, often matching models 3-5x larger on targeted benchmarks.
Model Overview
Gemma 2 2B — Google's smallest model. 2.6B parameters, trained on 2 trillion tokens of web data. Excels at summarization, classification, and code generation for its size. Licensed under a permissive Gemma license for commercial use.
Gemma 2 9B — The mid-range option. Outperforms Llama 3.1 8B on several benchmarks while being slightly more efficient to serve.
Phi-3.5-mini — Microsoft's 3.8B model. Trained on a mix of filtered web data and synthetic data generated by larger models. Remarkably strong at reasoning and code generation.
Phi-3-small — 7B parameters with a focus on reasoning. Competes with larger models on math and logic benchmarks.
Running Gemma Locally
Using Ollama is the quickest way to get started:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
# Pull Gemma 2B (1.6 GB)
ollama pull gemma2:2b
# Test it
ollama run gemma2:2b "Classify this as positive or negative: The product is excellent"
For Python integration:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
def classify_sentiment(text: str) -> str:
response = client.chat.completions.create(
model="gemma2:2b",
messages=[
{"role": "user", "content":
f"Classify the sentiment as positive, negative, or neutral. "
f"Respond with one word only.\n\nText: {text}"},
],
temperature=0.0,
max_tokens=5,
)
return response.choices[0].message.content.strip().lower()
print(classify_sentiment("This product exceeded my expectations!")) # positive
print(classify_sentiment("The delivery was late and the item was damaged.")) # negative
Running Phi on Edge Devices
Phi models are optimized for ONNX Runtime, making them deployable on a wide range of hardware including CPUs and mobile NPUs:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "microsoft/Phi-3.5-mini-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a concise assistant that extracts structured data."},
{"role": "user", "content": "Extract the name, date, and amount from: "
"Invoice from John Smith dated March 15, 2026 for $2,500."},
]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=100, temperature=0.1, do_sample=True)
result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(result)
Agent Patterns for Constrained Environments
SLMs require different agent design patterns than large models. The key principle is to simplify the task structure so the model can handle each step reliably.
Pattern 1: Single-Purpose Agents — Instead of one general agent, deploy multiple specialized micro-agents:
class EdgeAgentRouter:
def __init__(self, client):
self.client = client
def route(self, user_input: str) -> str:
# Step 1: Classify intent with the SLM
intent = self._classify_intent(user_input)
# Step 2: Route to specialized handler
handlers = {
"weather": self._handle_weather,
"reminder": self._handle_reminder,
"question": self._handle_question,
}
handler = handlers.get(intent, self._handle_question)
return handler(user_input)
def _classify_intent(self, text: str) -> str:
response = self.client.chat.completions.create(
model="gemma2:2b",
messages=[{"role": "user", "content":
f"Classify this user request into one category: "
f"weather, reminder, question.\n"
f"Respond with the category only.\nRequest: {text}"}],
temperature=0.0,
max_tokens=10,
)
return response.choices[0].message.content.strip().lower()
def _handle_weather(self, text: str) -> str:
# Extract city, call weather API
return "Weather handler triggered"
def _handle_reminder(self, text: str) -> str:
# Extract time and message, set reminder
return "Reminder handler triggered"
def _handle_question(self, text: str) -> str:
response = self.client.chat.completions.create(
model="gemma2:2b",
messages=[{"role": "user", "content": text}],
temperature=0.3,
max_tokens=200,
)
return response.choices[0].message.content
Pattern 2: Structured Output with Constrained Generation — Use explicit output formats to compensate for smaller models' tendency to be less structured:
def extract_entities(client, text: str) -> dict:
response = client.chat.completions.create(
model="phi3.5:latest",
messages=[{"role": "user", "content":
f"Extract entities from this text. Respond in exactly this format:\n"
f"NAME: <name or NONE>\n"
f"DATE: <date or NONE>\n"
f"AMOUNT: <amount or NONE>\n\n"
f"Text: {text}"}],
temperature=0.0,
max_tokens=50,
)
result = {}
for line in response.choices[0].message.content.strip().split("\n"):
if ": " in line:
key, value = line.split(": ", 1)
if value.strip() != "NONE":
result[key.strip()] = value.strip()
return result
Memory and Performance Benchmarks
| Model | Parameters | RAM (Q4) | Tokens/sec (CPU) | Tokens/sec (GPU) |
|---|---|---|---|---|
| Gemma 2 2B | 2.6B | 1.8 GB | 15-25 | 80-120 |
| Phi-3.5-mini | 3.8B | 2.5 GB | 10-20 | 60-100 |
| Gemma 2 9B | 9.2B | 5.5 GB | 5-10 | 40-70 |
| Phi-3-small | 7B | 4.5 GB | 5-12 | 35-60 |
CPU token rates are measured on a modern laptop (Apple M2 / Intel i7-13th gen). GPU rates are on an RTX 3060 12 GB.
FAQ
Can a 2B model really handle agent tasks reliably?
For narrowly scoped tasks like classification, entity extraction, and template-based responses, yes. A Gemma 2B model fine-tuned on your specific task can be remarkably reliable. For open-ended reasoning or complex multi-step tool calling, you need at least a 7B model.
How do I deploy an SLM on a mobile phone?
Use the GGUF format with llama.cpp compiled for ARM. On Android, libraries like android-llama.cpp provide JNI bindings. On iOS, use llama.cpp with Metal for GPU acceleration. Expect 5-15 tokens/second on flagship phones with quantized 2-3B models.
Is fine-tuning necessary for SLMs in agent applications?
Fine-tuning is more impactful for SLMs than for large models. A generic 2B model may struggle with your specific output format, but a fine-tuned version can match larger models on that narrow task. Use LoRA fine-tuning with 500-2000 examples of your expected input/output pairs for the best results.
#Gemma #Phi #SmallLanguageModels #EdgeAI #MobileDeployment #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.