LLM Inference Optimization: Quantization, Speculative Decoding, and Beyond
A technical guide to modern LLM inference optimization techniques — quantization, speculative decoding, KV-cache optimization, continuous batching, and PagedAttention. Make models faster and cheaper.
Why Inference Optimization Matters
Training a large language model is a one-time cost. Inference — serving predictions to users — is the ongoing expense that determines whether a model is economically viable in production. A model that costs $10 million to train but $0.001 per query can generate billions of responses profitably. The same model at $0.10 per query may be commercially unviable.
Inference optimization is the discipline of making models faster, cheaper, and more memory-efficient without sacrificing output quality. Here are the techniques that matter most in 2026.
Quantization: Trading Precision for Speed
Quantization reduces the numerical precision of model weights from 16-bit or 32-bit floating point to lower bit widths (8-bit, 4-bit, or even 2-bit integers).
Why it works: Most model weights cluster around small values. The difference between representing a weight as 0.0234375 (FP16) versus 0.023 (INT8) is negligible for output quality but halves memory usage.
Common quantization methods:
| Method | Bits | Quality Loss | Speed Gain | Memory Reduction |
|---|---|---|---|---|
| FP16 (baseline) | 16 | None | 1x | 1x |
| INT8 (W8A8) | 8 | Minimal | 1.5-2x | 2x |
| GPTQ (W4A16) | 4 | Small | 2-3x | 4x |
| AWQ | 4 | Small | 2-3x | 4x |
| GGUF Q4_K_M | 4 | Small | 2-3x | 4x |
| QuIP# | 2 | Moderate | 4-5x | 8x |
Practical example: A 70B parameter model requires ~140GB in FP16, needing 2x A100 80GB GPUs. With 4-bit quantization, it fits on a single A100 or even a consumer RTX 4090 (24GB).
# Quantizing with llama.cpp
./quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
# Serving with vLLM and AWQ quantization
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-3.3-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1
Speculative Decoding: Draft and Verify
LLM inference is bottlenecked by sequential token generation — each token requires a full forward pass. Speculative decoding breaks this bottleneck by using a small, fast "draft" model to generate candidate tokens, then verifying them in parallel with the large model.
How it works:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
- The draft model (e.g., Llama 3.3 8B) generates K candidate tokens quickly
- The target model (e.g., Llama 3.3 70B) verifies all K tokens in a single forward pass
- Accepted tokens are kept; the first rejected token is replaced with the target model's choice
- The process repeats
Speedup: When the draft model's predictions match the target model (which happens 70-90% of the time for well-chosen pairs), you get K tokens for the cost of ~1 forward pass of the large model. Typical speedups: 2-3x for well-matched model pairs.
KV-Cache Optimization
During autoregressive generation, the Key-Value cache stores computed attention states for all previous tokens. This cache grows linearly with sequence length and can consume more memory than the model weights for long contexts.
Techniques:
- Multi-Query Attention (MQA): Share key/value heads across attention heads, reducing KV-cache by 8-32x
- Grouped-Query Attention (GQA): A middle ground — share KV heads in groups rather than fully
- KV-cache quantization: Compress cached key/value tensors to INT8, halving cache memory
- Sliding window attention: Limit attention to recent tokens plus landmark tokens, capping cache size
PagedAttention and vLLM
PagedAttention, the innovation behind vLLM, manages KV-cache memory the way operating systems manage virtual memory — in non-contiguous pages.
Problem solved: Traditional KV-cache allocation pre-allocates memory based on maximum sequence length, wasting memory for shorter sequences. With batch sizes of 100+ concurrent requests, this waste becomes the primary bottleneck.
How PagedAttention helps:
- Allocates KV-cache in small blocks (pages) on demand
- Eliminates memory waste from pre-allocation
- Enables sharing KV-cache pages across requests using the same prefix (prompt caching)
- Increases throughput by 2-4x compared to naive implementations
# vLLM automatically uses PagedAttention
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
tensor_parallel_size=2,
max_model_len=32768,
gpu_memory_utilization=0.90
)
outputs = llm.generate(
prompts=["Explain quantum computing" for _ in range(100)],
sampling_params=SamplingParams(temperature=0.7, max_tokens=512)
)
Continuous Batching
Traditional static batching waits for a full batch before processing and waits for the longest sequence to finish before returning any results. Continuous batching (also called iteration-level batching) inserts new requests and returns completed requests at every generation step.
Impact: Reduces average latency by 50-80% under load and increases throughput by 2-3x compared to static batching. All modern serving frameworks (vLLM, TGI, TensorRT-LLM) implement continuous batching by default.
Putting It All Together
A production-optimized inference stack combines multiple techniques:
Request → Continuous Batching Engine
├── PagedAttention (memory efficiency)
├── Quantized Model (INT8/INT4)
├── GQA/MQA (reduced KV-cache)
├── Speculative Decoding (speed)
└── Prefix Caching (shared prompts)
The compound effect of these optimizations is dramatic: a well-optimized serving stack can serve 10-50x more requests per GPU compared to a naive implementation, reducing per-query costs proportionally.
Sources: vLLM — PagedAttention Paper, Hugging Face — Quantization Guide, DeepSpeed — Inference Optimization
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.