Skip to content
Learn Agentic AI12 min read0 views

Running Open-Source LLMs Locally: Ollama, vLLM, and llama.cpp Setup Guide

A practical guide to running open-source language models on your own hardware using Ollama, vLLM, and llama.cpp, covering installation, model management, API compatibility, and performance optimization.

Why Run LLMs Locally

Running language models on your own hardware gives you data privacy, zero per-token costs, full control over the model, and no rate limits. The tradeoff is that you need to manage hardware, handle scaling, and accept that smaller local models will not match the quality of frontier cloud models like GPT-4o or Claude.

Three tools dominate the local LLM ecosystem. Ollama is the easiest to set up and best for development. vLLM delivers the highest throughput for production serving. llama.cpp provides maximum flexibility and runs on CPU-only machines.

Ollama: The Easiest Path

Ollama packages model downloading, quantization, and serving into a single binary. It runs on macOS, Linux, and Windows.

# After installing Ollama (curl -fsSL https://ollama.com/install.sh | sh)

# Pull a model
# ollama pull llama3.1:8b

# Ollama exposes an OpenAI-compatible API at http://localhost:11434
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required but not validated
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to merge two sorted lists."},
    ],
    temperature=0.0,
)
print(response.choices[0].message.content)

Managing models with Ollama:

import subprocess
import json

def list_ollama_models() -> list[dict]:
    """List all downloaded Ollama models."""
    result = subprocess.run(
        ["ollama", "list"], capture_output=True, text=True
    )
    lines = result.stdout.strip().split("\n")[1:]  # Skip header
    models = []
    for line in lines:
        parts = line.split()
        if len(parts) >= 4:
            models.append({
                "name": parts[0],
                "id": parts[1],
                "size": parts[2] + " " + parts[3],
            })
    return models

def create_custom_model(
    name: str,
    base_model: str,
    system_prompt: str,
    temperature: float = 0.7,
) -> str:
    """Create a custom Ollama model with a Modelfile."""
    modelfile = f"""FROM {base_model}
SYSTEM {json.dumps(system_prompt)}
PARAMETER temperature {temperature}
PARAMETER num_ctx 4096
"""
    modelfile_path = f"/tmp/{name}.Modelfile"
    with open(modelfile_path, "w") as f:
        f.write(modelfile)

    result = subprocess.run(
        ["ollama", "create", name, "-f", modelfile_path],
        capture_output=True, text=True,
    )
    return result.stdout

vLLM: Production-Grade Serving

vLLM is an inference engine designed for high throughput. It uses PagedAttention to manage GPU memory efficiently, supports continuous batching, and delivers 2-4x higher throughput than naive HuggingFace inference.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

# Install: pip install vllm

# Start vLLM server (OpenAI-compatible)
# python -m vllm.entrypoints.openai.api_server #   --model meta-llama/Llama-3.1-8B-Instruct #   --dtype bfloat16 #   --max-model-len 8192 #   --gpu-memory-utilization 0.9 #   --port 8000

# Use exactly like OpenAI API
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

# Synchronous request
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Explain gradient descent in three sentences."},
    ],
    temperature=0.0,
    max_tokens=256,
)
print(response.choices[0].message.content)

Benchmarking vLLM throughput:

import asyncio
import time
from openai import AsyncOpenAI

async def benchmark_throughput(
    base_url: str,
    model: str,
    num_requests: int = 100,
    max_concurrent: int = 10,
) -> dict:
    """Benchmark inference throughput with concurrent requests."""
    client = AsyncOpenAI(base_url=base_url, api_key="x")
    semaphore = asyncio.Semaphore(max_concurrent)
    latencies = []

    async def single_request(prompt: str):
        async with semaphore:
            start = time.perf_counter()
            response = await client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=128,
                temperature=0.0,
            )
            elapsed = time.perf_counter() - start
            latencies.append(elapsed)
            tokens = response.usage.completion_tokens
            return tokens

    prompts = [f"What is the square root of {i * 17}?" for i in range(num_requests)]
    start_total = time.perf_counter()
    results = await asyncio.gather(*[single_request(p) for p in prompts])
    total_time = time.perf_counter() - start_total

    total_tokens = sum(results)
    return {
        "total_requests": num_requests,
        "total_time_s": round(total_time, 2),
        "requests_per_second": round(num_requests / total_time, 1),
        "tokens_per_second": round(total_tokens / total_time, 1),
        "avg_latency_ms": round(sum(latencies) / len(latencies) * 1000, 0),
        "p99_latency_ms": round(sorted(latencies)[int(0.99 * len(latencies))] * 1000, 0),
    }

llama.cpp: Maximum Flexibility

llama.cpp runs models on CPU, Apple Silicon, CUDA GPUs, and even mobile devices. It uses GGUF quantized models for efficient memory usage.

# Install Python bindings: pip install llama-cpp-python

# For GPU acceleration:
# CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

from llama_cpp import Llama

# Load a GGUF model
llm = Llama(
    model_path="./models/llama-3.1-8b-instruct-q4_k_m.gguf",
    n_ctx=4096,         # Context window
    n_gpu_layers=-1,    # -1 = offload all layers to GPU
    n_threads=8,        # CPU threads for non-GPU layers
    verbose=False,
)

# Chat completion
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the SOLID principles?"},
    ],
    temperature=0.0,
    max_tokens=512,
)
print(response["choices"][0]["message"]["content"])

Performance Comparison

Feature Ollama vLLM llama.cpp
Setup difficulty Very easy Moderate Moderate
GPU required No (but helps) Yes No
Throughput Good Best Good
Concurrent requests Limited Excellent Limited
Model format Ollama/GGUF HF Transformers GGUF
OpenAI-compatible API Yes Yes Yes (server mode)
Best for Development Production serving Edge/CPU deployment

FAQ

Which tool should I use for local development and prototyping?

Ollama is the clear choice for development. It installs with a single command, downloads models automatically, and runs with no configuration. The OpenAI-compatible API means you can develop against Ollama and switch to a cloud API for production by changing only the base URL. Use vLLM only when you need production-level throughput or concurrent request handling.

How much VRAM do I need to run different model sizes locally?

For quantized models (Q4_K_M, the most common quantization): 7-8B parameter models need 4-6 GB VRAM, 13B models need 8-10 GB, and 70B models need 36-40 GB. Full-precision (bf16) models require roughly 2x the parameter count in bytes — so 8B parameters need 16 GB. Consumer GPUs like RTX 4090 (24 GB) can comfortably run 8-13B quantized models.

Can I serve fine-tuned LoRA adapters with these tools?

Yes, all three support LoRA adapters. Ollama can import adapters through Modelfiles. vLLM supports loading LoRA adapters at runtime and even serving multiple adapters simultaneously with the same base model. llama.cpp supports GGUF-format adapters that can be applied on top of a base model. For vLLM, this is especially powerful because you can A/B test multiple fine-tuned variants without duplicating the base model in memory.


#Ollama #VLLM #Llamacpp #LocalLLM #OpenSource #SelfHosted #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.