Running Open-Source LLMs Locally: Ollama, vLLM, and llama.cpp Setup Guide

Why Run LLMs Locally

Running language models on your own hardware gives you data privacy, zero per-token costs, full control over the model, and no rate limits. The tradeoff is that you need to manage hardware, handle scaling, and accept that smaller local models will not match the quality of frontier cloud models like GPT-4o or Claude.

Three tools dominate the local LLM ecosystem. Ollama is the easiest to set up and best for development. vLLM delivers the highest throughput for production serving. llama.cpp provides maximum flexibility and runs on CPU-only machines.

Ollama: The Easiest Path

Ollama packages model downloading, quantization, and serving into a single binary. It runs on macOS, Linux, and Windows.

# After installing Ollama (curl -fsSL https://ollama.com/install.sh | sh)

# Pull a model
# ollama pull llama3.1:8b

# Ollama exposes an OpenAI-compatible API at http://localhost:11434
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required but not validated
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to merge two sorted lists."},
    ],
    temperature=0.0,
)
print(response.choices[0].message.content)

Managing models with Ollama:

import subprocess
import json

def list_ollama_models() -> list[dict]:
    """List all downloaded Ollama models."""
    result = subprocess.run(
        ["ollama", "list"], capture_output=True, text=True
    )
    lines = result.stdout.strip().split("\n")[1:]  # Skip header
    models = []
    for line in lines:
        parts = line.split()
        if len(parts) >= 4:
            models.append({
                "name": parts[0],
                "id": parts[1],
                "size": parts[2] + " " + parts[3],
            })
    return models

def create_custom_model(
    name: str,
    base_model: str,
    system_prompt: str,
    temperature: float = 0.7,
) -> str:
    """Create a custom Ollama model with a Modelfile."""
    modelfile = f"""FROM {base_model}
SYSTEM {json.dumps(system_prompt)}
PARAMETER temperature {temperature}
PARAMETER num_ctx 4096
"""
    modelfile_path = f"/tmp/{name}.Modelfile"
    with open(modelfile_path, "w") as f:
        f.write(modelfile)

    result = subprocess.run(
        ["ollama", "create", name, "-f", modelfile_path],
        capture_output=True, text=True,
    )
    return result.stdout

vLLM: Production-Grade Serving

vLLM is an inference engine designed for high throughput. It uses PagedAttention to manage GPU memory efficiently, supports continuous batching, and delivers 2-4x higher throughput than naive HuggingFace inference.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# Install: pip install vllm

# Start vLLM server (OpenAI-compatible)
# python -m vllm.entrypoints.openai.api_server #   --model meta-llama/Llama-3.1-8B-Instruct #   --dtype bfloat16 #   --max-model-len 8192 #   --gpu-memory-utilization 0.9 #   --port 8000

# Use exactly like OpenAI API
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

# Synchronous request
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Explain gradient descent in three sentences."},
    ],
    temperature=0.0,
    max_tokens=256,
)
print(response.choices[0].message.content)

Benchmarking vLLM throughput:

import asyncio
import time
from openai import AsyncOpenAI

async def benchmark_throughput(
    base_url: str,
    model: str,
    num_requests: int = 100,
    max_concurrent: int = 10,
) -> dict:
    """Benchmark inference throughput with concurrent requests."""
    client = AsyncOpenAI(base_url=base_url, api_key="x")
    semaphore = asyncio.Semaphore(max_concurrent)
    latencies = []

    async def single_request(prompt: str):
        async with semaphore:
            start = time.perf_counter()
            response = await client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=128,
                temperature=0.0,
            )
            elapsed = time.perf_counter() - start
            latencies.append(elapsed)
            tokens = response.usage.completion_tokens
            return tokens

    prompts = [f"What is the square root of {i * 17}?" for i in range(num_requests)]
    start_total = time.perf_counter()
    results = await asyncio.gather(*[single_request(p) for p in prompts])
    total_time = time.perf_counter() - start_total

    total_tokens = sum(results)
    return {
        "total_requests": num_requests,
        "total_time_s": round(total_time, 2),
        "requests_per_second": round(num_requests / total_time, 1),
        "tokens_per_second": round(total_tokens / total_time, 1),
        "avg_latency_ms": round(sum(latencies) / len(latencies) * 1000, 0),
        "p99_latency_ms": round(sorted(latencies)[int(0.99 * len(latencies))] * 1000, 0),
    }

llama.cpp: Maximum Flexibility

llama.cpp runs models on CPU, Apple Silicon, CUDA GPUs, and even mobile devices. It uses GGUF quantized models for efficient memory usage.

# Install Python bindings: pip install llama-cpp-python

# For GPU acceleration:
# CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

from llama_cpp import Llama

# Load a GGUF model
llm = Llama(
    model_path="./models/llama-3.1-8b-instruct-q4_k_m.gguf",
    n_ctx=4096,         # Context window
    n_gpu_layers=-1,    # -1 = offload all layers to GPU
    n_threads=8,        # CPU threads for non-GPU layers
    verbose=False,
)

# Chat completion
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the SOLID principles?"},
    ],
    temperature=0.0,
    max_tokens=512,
)
print(response["choices"][0]["message"]["content"])

Performance Comparison

Feature	Ollama	vLLM	llama.cpp
Setup difficulty	Very easy	Moderate	Moderate
GPU required	No (but helps)	Yes	No
Throughput	Good	Best	Good
Concurrent requests	Limited	Excellent	Limited
Model format	Ollama/GGUF	HF Transformers	GGUF
OpenAI-compatible API	Yes	Yes	Yes (server mode)
Best for	Development	Production serving	Edge/CPU deployment

FAQ

Which tool should I use for local development and prototyping?

Ollama is the clear choice for development. It installs with a single command, downloads models automatically, and runs with no configuration. The OpenAI-compatible API means you can develop against Ollama and switch to a cloud API for production by changing only the base URL. Use vLLM only when you need production-level throughput or concurrent request handling.

How much VRAM do I need to run different model sizes locally?

For quantized models (Q4_K_M, the most common quantization): 7-8B parameter models need 4-6 GB VRAM, 13B models need 8-10 GB, and 70B models need 36-40 GB. Full-precision (bf16) models require roughly 2x the parameter count in bytes — so 8B parameters need 16 GB. Consumer GPUs like RTX 4090 (24 GB) can comfortably run 8-13B quantized models.

Can I serve fine-tuned LoRA adapters with these tools?

Yes, all three support LoRA adapters. Ollama can import adapters through Modelfiles. vLLM supports loading LoRA adapters at runtime and even serving multiple adapters simultaneously with the same base model. llama.cpp supports GGUF-format adapters that can be applied on top of a base model. For vLLM, this is especially powerful because you can A/B test multiple fine-tuned variants without duplicating the base model in memory.

#Ollama #VLLM #Llamacpp #LocalLLM #OpenSource #SelfHosted #AgenticAI #LearnAI #AIEngineering

Running Open-Source LLMs Locally: Ollama, vLLM, and llama.cpp Setup Guide

Why Run LLMs Locally

Ollama: The Easiest Path

vLLM: Production-Grade Serving

llama.cpp: Maximum Flexibility

Performance Comparison

FAQ

Which tool should I use for local development and prototyping?

How much VRAM do I need to run different model sizes locally?

Can I serve fine-tuned LoRA adapters with these tools?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding