Skip to content
Learn Agentic AI12 min read0 views

vLLM for High-Throughput LLM Serving: Running Open-Source Models in Production

Set up vLLM for production-grade LLM inference with PagedAttention, continuous batching, and OpenAI-compatible APIs. Learn performance tuning for serving open-source models at scale.

The Problem with Naive LLM Serving

When you load a model with Hugging Face Transformers and call model.generate(), each request is processed one at a time. The GPU sits idle between the prefill phase (processing the prompt) and the decode phase (generating tokens). With multiple concurrent users, requests queue up and latency becomes unacceptable.

vLLM solves this with two key innovations: PagedAttention for memory-efficient KV-cache management, and continuous batching that dynamically groups requests to maximize GPU utilization. The result is 2-24x higher throughput compared to naive serving, depending on the workload.

Installing vLLM

vLLM requires a CUDA-capable GPU. Install it with pip:

pip install vllm

For a specific CUDA version:

pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

Verify GPU detection:

import vllm
from vllm import LLM

# This will show available GPUs
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

Launching the OpenAI-Compatible Server

The fastest path to production is vLLM's built-in API server, which exposes OpenAI-compatible endpoints:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

This gives you /v1/chat/completions, /v1/completions, and /v1/models endpoints that any OpenAI-compatible client can consume immediately.

How PagedAttention Works

Traditional LLM serving pre-allocates contiguous memory blocks for the KV-cache of each request, based on the maximum possible sequence length. This wastes enormous amounts of GPU memory — a request that generates only 50 tokens still reserves memory for 4096 tokens.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

PagedAttention borrows the concept of virtual memory paging from operating systems. The KV-cache is divided into fixed-size blocks (pages) that are allocated on demand as tokens are generated. This reduces memory waste from 60-80% to under 4%, enabling far more concurrent requests.

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    gpu_memory_utilization=0.90,  # Use 90% of GPU memory
    max_model_len=8192,
    block_size=16,  # KV-cache block size (default: 16)
)

# Process a batch of prompts simultaneously
prompts = [
    "Explain quantum computing to a 10-year-old.",
    "Write a Python function for binary search.",
    "What caused the 2008 financial crisis?",
    "Summarize the theory of relativity.",
]

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(prompts, params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Response: {output.outputs[0].text[:100]}...")
    print("---")

Continuous Batching for Agent Workloads

Agent systems generate bursty, variable-length requests. One agent call might produce 20 tokens (a tool call), while another generates 500 tokens (a detailed explanation). Continuous batching handles this gracefully by adding new requests to the batch as soon as existing requests finish, rather than waiting for the entire batch to complete.

Configure batching parameters for agent workloads:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-num-batched-tokens 32768 \
    --max-num-seqs 256 \
    --enable-chunked-prefill

The --enable-chunked-prefill flag allows long prompts to be split across iterations, preventing a single large prompt from blocking the entire batch.

Connecting Agents to vLLM

Since vLLM exposes an OpenAI-compatible API, your agent code remains identical — just change the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

def agent_step(messages: list) -> str:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=messages,
        temperature=0.1,  # Lower temperature for agent reliability
        max_tokens=1024,
    )
    return response.choices[0].message.content

# Agent loop
messages = [{"role": "system", "content": "You are an analytical agent."}]
messages.append({"role": "user", "content": "Analyze recent trends in AI."})

result = agent_step(messages)
print(result)

Performance Tuning Checklist

Maximize throughput with these settings:

# Tensor parallelism across multiple GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 16384 \
    --quantization awq  # Use quantized model for faster inference

Key tuning levers: increase gpu-memory-utilization to allow more concurrent requests, use tensor-parallel-size to split large models across GPUs, and enable quantization to reduce memory footprint without significant quality loss.

FAQ

How does vLLM compare to Ollama for production use?

Ollama is designed for single-user local inference with a focus on ease of use. vLLM is built for multi-user production serving with high concurrency. If you need to serve 50+ concurrent agent requests, vLLM is the right choice. For local development with one or two concurrent requests, Ollama is simpler.

Can vLLM serve multiple models simultaneously?

A single vLLM server instance serves one model. To serve multiple models, run multiple vLLM instances on different ports or GPUs, then use a router or load balancer to direct requests to the appropriate instance.

What GPU do I need for vLLM?

vLLM requires an NVIDIA GPU with CUDA support. For 7-8B parameter models, a single GPU with 16+ GB VRAM (RTX 4090, A10G, or L4) works well. For 70B models, you need multiple GPUs totaling 80+ GB VRAM or use quantized variants.


#VLLM #LLMServing #ProductionAI #PagedAttention #OpenSource #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.