Running Llama Models Locally with Ollama: Setup and Agent Integration

Why Run Models Locally?

Cloud-hosted LLM APIs are convenient, but they come with per-token costs, network latency, rate limits, and data privacy concerns. Running models locally eliminates all four. Ollama makes local inference as simple as pulling a Docker image — you download a model once, and it runs on your hardware with zero API fees.

For agent development specifically, local models let you iterate rapidly without worrying about billing. You can run thousands of test invocations during development, experiment with different system prompts, and debug tool-calling behavior without watching a usage meter tick upward.

Installing Ollama

Ollama runs on macOS, Linux, and Windows. The installation is a single command on most systems.

macOS and Linux:

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

ollama --version

After installation, Ollama runs a background server that listens on http://localhost:11434 by default. This server exposes both a native API and an OpenAI-compatible endpoint.

Pulling and Running Llama Models

Ollama hosts a registry of pre-quantized models. Pulling a model downloads it to your local cache:

# Pull Llama 3.1 8B (4.7 GB)
ollama pull llama3.1:8b

# Pull a smaller quantized variant
ollama pull llama3.1:8b-q4_0

# List all downloaded models
ollama list

Run an interactive chat to verify the model works:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

ollama run llama3.1:8b "Explain what an AI agent is in two sentences."

Using the OpenAI-Compatible API

Ollama exposes an endpoint at /v1/chat/completions that mirrors the OpenAI Chat Completions API. This means you can use the standard openai Python package with zero code changes — just point it at your local server:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required by the client but not validated
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the benefits of agentic AI?"},
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

This compatibility layer is critical for agent frameworks. Any framework that supports OpenAI-format requests — including LangChain, CrewAI, and the OpenAI Agents SDK via LiteLLM — can use Ollama as a backend.

Building a Simple Agent with Ollama

Here is a complete agent with tool calling that runs entirely on your local machine:

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"}
                },
                "required": ["city"],
            },
        },
    }
]

def get_weather(city: str) -> str:
    # Simulated weather lookup
    return json.dumps({"city": city, "temp": "22C", "condition": "sunny"})

def run_agent(user_message: str):
    messages = [
        {"role": "system", "content": "You are a weather assistant. Use tools when needed."},
        {"role": "user", "content": user_message},
    ]

    response = client.chat.completions.create(
        model="llama3.1:8b",
        messages=messages,
        tools=tools,
    )

    msg = response.choices[0].message

    if msg.tool_calls:
        for call in msg.tool_calls:
            result = get_weather(**json.loads(call.function.arguments))
            messages.append(msg)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": result,
            })

        final = client.chat.completions.create(
            model="llama3.1:8b",
            messages=messages,
        )
        return final.choices[0].message.content

    return msg.content

print(run_agent("What is the weather in Tokyo?"))

Performance Tips for Local Inference

GPU acceleration makes an enormous difference. On a MacBook with an M-series chip, Ollama automatically uses the GPU via Metal. On Linux, ensure you have NVIDIA drivers and the CUDA toolkit installed:

# Check if Ollama detects your GPU
ollama ps

For agents that make many sequential calls, keep the model loaded in memory by setting OLLAMA_KEEP_ALIVE:

export OLLAMA_KEEP_ALIVE=30m  # Keep model loaded for 30 minutes

This avoids the 2-5 second cold-start penalty on each request.

FAQ

Does Ollama support function calling with all models?

Not all models support structured tool calling. Llama 3.1, Mistral, and command-r models have native tool-calling support in Ollama. Smaller or older models may require manual prompt engineering to simulate tool use.

How much RAM do I need to run Llama 3.1 8B?

The 4-bit quantized version requires approximately 5-6 GB of RAM. The full-precision version needs around 16 GB. For the 70B variant, expect 40+ GB for the quantized version, making it practical only on high-end workstations or servers.

Can I run Ollama on a machine without a GPU?

Yes, Ollama falls back to CPU inference automatically. Performance will be significantly slower — expect 5-15 tokens per second on a modern CPU versus 40-80 tokens per second on a mid-range GPU — but it works for development and testing.

#Ollama #Llama #LocalLLM #OpenSourceAI #AgentDevelopment #AgenticAI #LearnAI #AIEngineering

Running Llama Models Locally with Ollama: Setup and Agent Integration

Why Run Models Locally?

Installing Ollama

Pulling and Running Llama Models

Using the OpenAI-Compatible API

Building a Simple Agent with Ollama

Performance Tips for Local Inference

FAQ

Does Ollama support function calling with all models?

How much RAM do I need to run Llama 3.1 8B?

Can I run Ollama on a machine without a GPU?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding