vLLM for High-Throughput LLM Serving: Running Open-Source Models in Production
Set up vLLM for production-grade LLM inference with PagedAttention, continuous batching, and OpenAI-compatible APIs. Learn performance tuning for serving open-source models at scale.
The Problem with Naive LLM Serving
When you load a model with Hugging Face Transformers and call model.generate(), each request is processed one at a time. The GPU sits idle between the prefill phase (processing the prompt) and the decode phase (generating tokens). With multiple concurrent users, requests queue up and latency becomes unacceptable.
vLLM solves this with two key innovations: PagedAttention for memory-efficient KV-cache management, and continuous batching that dynamically groups requests to maximize GPU utilization. The result is 2-24x higher throughput compared to naive serving, depending on the workload.
Installing vLLM
vLLM requires a CUDA-capable GPU. Install it with pip:
pip install vllm
For a specific CUDA version:
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
Verify GPU detection:
import vllm
from vllm import LLM
# This will show available GPUs
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
Launching the OpenAI-Compatible Server
The fastest path to production is vLLM's built-in API server, which exposes OpenAI-compatible endpoints:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
This gives you /v1/chat/completions, /v1/completions, and /v1/models endpoints that any OpenAI-compatible client can consume immediately.
How PagedAttention Works
Traditional LLM serving pre-allocates contiguous memory blocks for the KV-cache of each request, based on the maximum possible sequence length. This wastes enormous amounts of GPU memory — a request that generates only 50 tokens still reserves memory for 4096 tokens.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
PagedAttention borrows the concept of virtual memory paging from operating systems. The KV-cache is divided into fixed-size blocks (pages) that are allocated on demand as tokens are generated. This reduces memory waste from 60-80% to under 4%, enabling far more concurrent requests.
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
gpu_memory_utilization=0.90, # Use 90% of GPU memory
max_model_len=8192,
block_size=16, # KV-cache block size (default: 16)
)
# Process a batch of prompts simultaneously
prompts = [
"Explain quantum computing to a 10-year-old.",
"Write a Python function for binary search.",
"What caused the 2008 financial crisis?",
"Summarize the theory of relativity.",
]
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(prompts, params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Response: {output.outputs[0].text[:100]}...")
print("---")
Continuous Batching for Agent Workloads
Agent systems generate bursty, variable-length requests. One agent call might produce 20 tokens (a tool call), while another generates 500 tokens (a detailed explanation). Continuous batching handles this gracefully by adding new requests to the batch as soon as existing requests finish, rather than waiting for the entire batch to complete.
Configure batching parameters for agent workloads:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-num-batched-tokens 32768 \
--max-num-seqs 256 \
--enable-chunked-prefill
The --enable-chunked-prefill flag allows long prompts to be split across iterations, preventing a single large prompt from blocking the entire batch.
Connecting Agents to vLLM
Since vLLM exposes an OpenAI-compatible API, your agent code remains identical — just change the base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
)
def agent_step(messages: list) -> str:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=messages,
temperature=0.1, # Lower temperature for agent reliability
max_tokens=1024,
)
return response.choices[0].message.content
# Agent loop
messages = [{"role": "system", "content": "You are an analytical agent."}]
messages.append({"role": "user", "content": "Analyze recent trends in AI."})
result = agent_step(messages)
print(result)
Performance Tuning Checklist
Maximize throughput with these settings:
# Tensor parallelism across multiple GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-model-len 16384 \
--quantization awq # Use quantized model for faster inference
Key tuning levers: increase gpu-memory-utilization to allow more concurrent requests, use tensor-parallel-size to split large models across GPUs, and enable quantization to reduce memory footprint without significant quality loss.
FAQ
How does vLLM compare to Ollama for production use?
Ollama is designed for single-user local inference with a focus on ease of use. vLLM is built for multi-user production serving with high concurrency. If you need to serve 50+ concurrent agent requests, vLLM is the right choice. For local development with one or two concurrent requests, Ollama is simpler.
Can vLLM serve multiple models simultaneously?
A single vLLM server instance serves one model. To serve multiple models, run multiple vLLM instances on different ports or GPUs, then use a router or load balancer to direct requests to the appropriate instance.
What GPU do I need for vLLM?
vLLM requires an NVIDIA GPU with CUDA support. For 7-8B parameter models, a single GPU with 16+ GB VRAM (RTX 4090, A10G, or L4) works well. For 70B models, you need multiple GPUs totaling 80+ GB VRAM or use quantized variants.
#VLLM #LLMServing #ProductionAI #PagedAttention #OpenSource #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.