Skip to content
Learn Agentic AI11 min read0 views

Running Llama Models Locally with Ollama: Setup and Agent Integration

Learn how to install Ollama, pull Llama models, serve them through OpenAI-compatible endpoints, and integrate local LLMs into your AI agent pipelines for private, cost-free inference.

Why Run Models Locally?

Cloud-hosted LLM APIs are convenient, but they come with per-token costs, network latency, rate limits, and data privacy concerns. Running models locally eliminates all four. Ollama makes local inference as simple as pulling a Docker image — you download a model once, and it runs on your hardware with zero API fees.

For agent development specifically, local models let you iterate rapidly without worrying about billing. You can run thousands of test invocations during development, experiment with different system prompts, and debug tool-calling behavior without watching a usage meter tick upward.

Installing Ollama

Ollama runs on macOS, Linux, and Windows. The installation is a single command on most systems.

macOS and Linux:

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation:

ollama --version

After installation, Ollama runs a background server that listens on http://localhost:11434 by default. This server exposes both a native API and an OpenAI-compatible endpoint.

Pulling and Running Llama Models

Ollama hosts a registry of pre-quantized models. Pulling a model downloads it to your local cache:

# Pull Llama 3.1 8B (4.7 GB)
ollama pull llama3.1:8b

# Pull a smaller quantized variant
ollama pull llama3.1:8b-q4_0

# List all downloaded models
ollama list

Run an interactive chat to verify the model works:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

ollama run llama3.1:8b "Explain what an AI agent is in two sentences."

Using the OpenAI-Compatible API

Ollama exposes an endpoint at /v1/chat/completions that mirrors the OpenAI Chat Completions API. This means you can use the standard openai Python package with zero code changes — just point it at your local server:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required by the client but not validated
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the benefits of agentic AI?"},
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

This compatibility layer is critical for agent frameworks. Any framework that supports OpenAI-format requests — including LangChain, CrewAI, and the OpenAI Agents SDK via LiteLLM — can use Ollama as a backend.

Building a Simple Agent with Ollama

Here is a complete agent with tool calling that runs entirely on your local machine:

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"}
                },
                "required": ["city"],
            },
        },
    }
]

def get_weather(city: str) -> str:
    # Simulated weather lookup
    return json.dumps({"city": city, "temp": "22C", "condition": "sunny"})

def run_agent(user_message: str):
    messages = [
        {"role": "system", "content": "You are a weather assistant. Use tools when needed."},
        {"role": "user", "content": user_message},
    ]

    response = client.chat.completions.create(
        model="llama3.1:8b",
        messages=messages,
        tools=tools,
    )

    msg = response.choices[0].message

    if msg.tool_calls:
        for call in msg.tool_calls:
            result = get_weather(**json.loads(call.function.arguments))
            messages.append(msg)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": result,
            })

        final = client.chat.completions.create(
            model="llama3.1:8b",
            messages=messages,
        )
        return final.choices[0].message.content

    return msg.content

print(run_agent("What is the weather in Tokyo?"))

Performance Tips for Local Inference

GPU acceleration makes an enormous difference. On a MacBook with an M-series chip, Ollama automatically uses the GPU via Metal. On Linux, ensure you have NVIDIA drivers and the CUDA toolkit installed:

# Check if Ollama detects your GPU
ollama ps

For agents that make many sequential calls, keep the model loaded in memory by setting OLLAMA_KEEP_ALIVE:

export OLLAMA_KEEP_ALIVE=30m  # Keep model loaded for 30 minutes

This avoids the 2-5 second cold-start penalty on each request.

FAQ

Does Ollama support function calling with all models?

Not all models support structured tool calling. Llama 3.1, Mistral, and command-r models have native tool-calling support in Ollama. Smaller or older models may require manual prompt engineering to simulate tool use.

How much RAM do I need to run Llama 3.1 8B?

The 4-bit quantized version requires approximately 5-6 GB of RAM. The full-precision version needs around 16 GB. For the 70B variant, expect 40+ GB for the quantized version, making it practical only on high-end workstations or servers.

Can I run Ollama on a machine without a GPU?

Yes, Ollama falls back to CPU inference automatically. Performance will be significantly slower — expect 5-15 tokens per second on a modern CPU versus 40-80 tokens per second on a mid-range GPU — but it works for development and testing.


#Ollama #Llama #LocalLLM #OpenSourceAI #AgentDevelopment #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.