Running Llama Models Locally with Ollama: Setup and Agent Integration
Learn how to install Ollama, pull Llama models, serve them through OpenAI-compatible endpoints, and integrate local LLMs into your AI agent pipelines for private, cost-free inference.
Why Run Models Locally?
Cloud-hosted LLM APIs are convenient, but they come with per-token costs, network latency, rate limits, and data privacy concerns. Running models locally eliminates all four. Ollama makes local inference as simple as pulling a Docker image — you download a model once, and it runs on your hardware with zero API fees.
For agent development specifically, local models let you iterate rapidly without worrying about billing. You can run thousands of test invocations during development, experiment with different system prompts, and debug tool-calling behavior without watching a usage meter tick upward.
Installing Ollama
Ollama runs on macOS, Linux, and Windows. The installation is a single command on most systems.
macOS and Linux:
curl -fsSL https://ollama.com/install.sh | sh
Verify the installation:
ollama --version
After installation, Ollama runs a background server that listens on http://localhost:11434 by default. This server exposes both a native API and an OpenAI-compatible endpoint.
Pulling and Running Llama Models
Ollama hosts a registry of pre-quantized models. Pulling a model downloads it to your local cache:
# Pull Llama 3.1 8B (4.7 GB)
ollama pull llama3.1:8b
# Pull a smaller quantized variant
ollama pull llama3.1:8b-q4_0
# List all downloaded models
ollama list
Run an interactive chat to verify the model works:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
ollama run llama3.1:8b "Explain what an AI agent is in two sentences."
Using the OpenAI-Compatible API
Ollama exposes an endpoint at /v1/chat/completions that mirrors the OpenAI Chat Completions API. This means you can use the standard openai Python package with zero code changes — just point it at your local server:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required by the client but not validated
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the benefits of agentic AI?"},
],
temperature=0.7,
)
print(response.choices[0].message.content)
This compatibility layer is critical for agent frameworks. Any framework that supports OpenAI-format requests — including LangChain, CrewAI, and the OpenAI Agents SDK via LiteLLM — can use Ollama as a backend.
Building a Simple Agent with Ollama
Here is a complete agent with tool calling that runs entirely on your local machine:
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"],
},
},
}
]
def get_weather(city: str) -> str:
# Simulated weather lookup
return json.dumps({"city": city, "temp": "22C", "condition": "sunny"})
def run_agent(user_message: str):
messages = [
{"role": "system", "content": "You are a weather assistant. Use tools when needed."},
{"role": "user", "content": user_message},
]
response = client.chat.completions.create(
model="llama3.1:8b",
messages=messages,
tools=tools,
)
msg = response.choices[0].message
if msg.tool_calls:
for call in msg.tool_calls:
result = get_weather(**json.loads(call.function.arguments))
messages.append(msg)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": result,
})
final = client.chat.completions.create(
model="llama3.1:8b",
messages=messages,
)
return final.choices[0].message.content
return msg.content
print(run_agent("What is the weather in Tokyo?"))
Performance Tips for Local Inference
GPU acceleration makes an enormous difference. On a MacBook with an M-series chip, Ollama automatically uses the GPU via Metal. On Linux, ensure you have NVIDIA drivers and the CUDA toolkit installed:
# Check if Ollama detects your GPU
ollama ps
For agents that make many sequential calls, keep the model loaded in memory by setting OLLAMA_KEEP_ALIVE:
export OLLAMA_KEEP_ALIVE=30m # Keep model loaded for 30 minutes
This avoids the 2-5 second cold-start penalty on each request.
FAQ
Does Ollama support function calling with all models?
Not all models support structured tool calling. Llama 3.1, Mistral, and command-r models have native tool-calling support in Ollama. Smaller or older models may require manual prompt engineering to simulate tool use.
How much RAM do I need to run Llama 3.1 8B?
The 4-bit quantized version requires approximately 5-6 GB of RAM. The full-precision version needs around 16 GB. For the 70B variant, expect 40+ GB for the quantized version, making it practical only on high-end workstations or servers.
Can I run Ollama on a machine without a GPU?
Yes, Ollama falls back to CPU inference automatically. Performance will be significantly slower — expect 5-15 tokens per second on a modern CPU versus 40-80 tokens per second on a mid-range GPU — but it works for development and testing.
#Ollama #Llama #LocalLLM #OpenSourceAI #AgentDevelopment #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.