Building Agents with Gemma and Phi: Small Language Models for Edge Deployment

The Case for Small Language Models

Not every agent needs a 70B parameter model. Many practical agent tasks — classification, extraction, simple Q&A, form filling, and basic tool calling — can be handled by models with 2-4 billion parameters. Small Language Models (SLMs) open up deployment scenarios that large models cannot reach: mobile phones, IoT devices, laptops without GPUs, and environments with no internet connectivity.

Google's Gemma and Microsoft's Phi families lead the SLM space. Both deliver surprisingly strong performance relative to their size, often matching models 3-5x larger on targeted benchmarks.

Model Overview

Gemma 2 2B — Google's smallest model. 2.6B parameters, trained on 2 trillion tokens of web data. Excels at summarization, classification, and code generation for its size. Licensed under a permissive Gemma license for commercial use.

Gemma 2 9B — The mid-range option. Outperforms Llama 3.1 8B on several benchmarks while being slightly more efficient to serve.

Phi-3.5-mini — Microsoft's 3.8B model. Trained on a mix of filtered web data and synthetic data generated by larger models. Remarkably strong at reasoning and code generation.

Phi-3-small — 7B parameters with a focus on reasoning. Competes with larger models on math and logic benchmarks.

Running Gemma Locally

Using Ollama is the quickest way to get started:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# Pull Gemma 2B (1.6 GB)
ollama pull gemma2:2b

# Test it
ollama run gemma2:2b "Classify this as positive or negative: The product is excellent"

For Python integration:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gemma2:2b",
        messages=[
            {"role": "user", "content":
             f"Classify the sentiment as positive, negative, or neutral. "
             f"Respond with one word only.\n\nText: {text}"},
        ],
        temperature=0.0,
        max_tokens=5,
    )
    return response.choices[0].message.content.strip().lower()

print(classify_sentiment("This product exceeded my expectations!"))  # positive
print(classify_sentiment("The delivery was late and the item was damaged."))  # negative

Running Phi on Edge Devices

Phi models are optimized for ONNX Runtime, making them deployable on a wide range of hardware including CPUs and mobile NPUs:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "microsoft/Phi-3.5-mini-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a concise assistant that extracts structured data."},
    {"role": "user", "content": "Extract the name, date, and amount from: "
     "Invoice from John Smith dated March 15, 2026 for $2,500."},
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=100, temperature=0.1, do_sample=True)
result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(result)

Agent Patterns for Constrained Environments

SLMs require different agent design patterns than large models. The key principle is to simplify the task structure so the model can handle each step reliably.

Pattern 1: Single-Purpose Agents — Instead of one general agent, deploy multiple specialized micro-agents:

class EdgeAgentRouter:
    def __init__(self, client):
        self.client = client

    def route(self, user_input: str) -> str:
        # Step 1: Classify intent with the SLM
        intent = self._classify_intent(user_input)

        # Step 2: Route to specialized handler
        handlers = {
            "weather": self._handle_weather,
            "reminder": self._handle_reminder,
            "question": self._handle_question,
        }
        handler = handlers.get(intent, self._handle_question)
        return handler(user_input)

    def _classify_intent(self, text: str) -> str:
        response = self.client.chat.completions.create(
            model="gemma2:2b",
            messages=[{"role": "user", "content":
                f"Classify this user request into one category: "
                f"weather, reminder, question.\n"
                f"Respond with the category only.\nRequest: {text}"}],
            temperature=0.0,
            max_tokens=10,
        )
        return response.choices[0].message.content.strip().lower()

    def _handle_weather(self, text: str) -> str:
        # Extract city, call weather API
        return "Weather handler triggered"

    def _handle_reminder(self, text: str) -> str:
        # Extract time and message, set reminder
        return "Reminder handler triggered"

    def _handle_question(self, text: str) -> str:
        response = self.client.chat.completions.create(
            model="gemma2:2b",
            messages=[{"role": "user", "content": text}],
            temperature=0.3,
            max_tokens=200,
        )
        return response.choices[0].message.content

Pattern 2: Structured Output with Constrained Generation — Use explicit output formats to compensate for smaller models' tendency to be less structured:

def extract_entities(client, text: str) -> dict:
    response = client.chat.completions.create(
        model="phi3.5:latest",
        messages=[{"role": "user", "content":
            f"Extract entities from this text. Respond in exactly this format:\n"
            f"NAME: <name or NONE>\n"
            f"DATE: <date or NONE>\n"
            f"AMOUNT: <amount or NONE>\n\n"
            f"Text: {text}"}],
        temperature=0.0,
        max_tokens=50,
    )

    result = {}
    for line in response.choices[0].message.content.strip().split("\n"):
        if ": " in line:
            key, value = line.split(": ", 1)
            if value.strip() != "NONE":
                result[key.strip()] = value.strip()
    return result

Memory and Performance Benchmarks

Model	Parameters	RAM (Q4)	Tokens/sec (CPU)	Tokens/sec (GPU)
Gemma 2 2B	2.6B	1.8 GB	15-25	80-120
Phi-3.5-mini	3.8B	2.5 GB	10-20	60-100
Gemma 2 9B	9.2B	5.5 GB	5-10	40-70
Phi-3-small	7B	4.5 GB	5-12	35-60

CPU token rates are measured on a modern laptop (Apple M2 / Intel i7-13th gen). GPU rates are on an RTX 3060 12 GB.

FAQ

Can a 2B model really handle agent tasks reliably?

For narrowly scoped tasks like classification, entity extraction, and template-based responses, yes. A Gemma 2B model fine-tuned on your specific task can be remarkably reliable. For open-ended reasoning or complex multi-step tool calling, you need at least a 7B model.

How do I deploy an SLM on a mobile phone?

Use the GGUF format with llama.cpp compiled for ARM. On Android, libraries like android-llama.cpp provide JNI bindings. On iOS, use llama.cpp with Metal for GPU acceleration. Expect 5-15 tokens/second on flagship phones with quantized 2-3B models.

Is fine-tuning necessary for SLMs in agent applications?

Fine-tuning is more impactful for SLMs than for large models. A generic 2B model may struggle with your specific output format, but a fine-tuned version can match larger models on that narrow task. Use LoRA fine-tuning with 500-2000 examples of your expected input/output pairs for the best results.

#Gemma #Phi #SmallLanguageModels #EdgeAI #MobileDeployment #AgenticAI #LearnAI #AIEngineering

Building Agents with Gemma and Phi: Small Language Models for Edge Deployment

The Case for Small Language Models

Model Overview

Running Gemma Locally

Running Phi on Edge Devices

Agent Patterns for Constrained Environments

Memory and Performance Benchmarks

FAQ

Can a 2B model really handle agent tasks reliably?

How do I deploy an SLM on a mobile phone?

Is fine-tuning necessary for SLMs in agent applications?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding