Quantization Techniques: Running Large Models on Smaller Hardware Without Losing Accuracy | CallSphere Blog
Quantization enables deploying large language models on constrained hardware by reducing numerical precision. Learn about FP4, FP8, INT8, and GPTQ techniques with practical accuracy trade-off analysis.
Why Quantization Matters
A 70-billion parameter model stored in standard FP16 precision requires approximately 140 GB of GPU memory just for the weights — before accounting for the KV cache, activations, and framework overhead. That exceeds the capacity of any single consumer GPU and requires multiple enterprise-grade GPUs.
Quantization reduces the numerical precision of model weights (and sometimes activations) from 16-bit floating point to lower-precision formats like 8-bit integers or 4-bit floats. The result: a 70B model that required 140 GB in FP16 fits in 35 GB at INT4 — runnable on a single high-end consumer GPU.
The engineering challenge is doing this without meaningful quality degradation. Modern quantization techniques have gotten remarkably good at this trade-off.
Numerical Formats Explained
Understanding the available formats is the foundation for choosing a quantization strategy.
FP16 (16-bit Floating Point)
The standard training and serving precision for most models. Provides a good balance between range and precision with 1 sign bit, 5 exponent bits, and 10 mantissa bits.
BF16 (Brain Floating Point 16)
Same total bits as FP16 but with 8 exponent bits and 7 mantissa bits. Larger dynamic range at the cost of precision. Preferred for training because gradient values span a wide range.
FP8 (8-bit Floating Point)
Two variants: E4M3 (4 exponent, 3 mantissa) for forward pass and E5M2 (5 exponent, 2 mantissa) for gradients. Halves memory compared to FP16 with minimal quality loss — typically less than 0.5% degradation on standard benchmarks.
INT8 (8-bit Integer)
Maps floating-point values to 256 integer levels. Requires calibration to determine the scaling factor that maps the float range to integers. Highly hardware-efficient — most modern GPUs have dedicated INT8 compute units.
INT4 / FP4 (4-bit)
Extreme compression: each weight uses only 4 bits. Quality preservation depends heavily on the quantization algorithm. Naive INT4 quantization is unusable; advanced methods like GPTQ and AWQ make it practical.
Quantization Methods
Post-Training Quantization (PTQ)
PTQ quantizes a pre-trained model without additional training. It is fast and requires only a small calibration dataset (typically 128 to 512 examples).
# Example: Quantizing a model with bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 4-bit quantization configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 - optimal for normally distributed weights
bnb_4bit_compute_dtype="bfloat16", # Compute in BF16 for accuracy
bnb_4bit_use_double_quant=True, # Quantize the quantization constants too
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-70B",
quantization_config=quantization_config,
device_map="auto",
)
# 70B model now fits in ~35GB VRAM
GPTQ (Generative Pre-trained Transformer Quantization)
GPTQ is a one-shot weight quantization method that minimizes the layer-wise reconstruction error. For each layer, it finds the quantized weights that produce the most similar output to the original FP16 weights when given calibration data.
Key advantages:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
- Produces high-quality INT4 quantized models
- One-time cost: quantization takes hours, but the resulting model serves indefinitely
- Broad hardware compatibility
AWQ (Activation-Aware Weight Quantization)
AWQ observes that not all weights are equally important. Weights corresponding to large activations contribute more to the output. AWQ protects these salient weights by keeping them at higher precision while aggressively quantizing less important weights.
GGUF / llama.cpp Quantization
The GGUF format (used by llama.cpp) supports a variety of quantization levels from Q2_K (2-bit) through Q8_0 (8-bit). It uses a block-wise quantization scheme where each block of weights gets its own scaling factor.
# Common GGUF quantization levels and their trade-offs:
Q2_K - 2.63 bpw - ~60% quality retention - extreme compression
Q3_K_M - 3.07 bpw - ~75% quality retention - aggressive but usable
Q4_K_M - 4.83 bpw - ~92% quality retention - best balance for most use cases
Q5_K_M - 5.69 bpw - ~96% quality retention - high quality
Q6_K - 6.56 bpw - ~99% quality retention - near-lossless
Q8_0 - 8.50 bpw - ~99.5% quality retention - minimal compression
Accuracy Trade-offs in Practice
The theoretical information loss from quantization does not always translate into meaningful quality degradation. Here are measured results from a representative 70B model:
| Precision | Memory (GB) | MMLU | HumanEval | MT-Bench | Throughput vs FP16 |
|---|---|---|---|---|---|
| FP16 | 140 | 82.1% | 81.7% | 8.9 | 1.0x |
| FP8 | 70 | 81.8% | 81.5% | 8.9 | 1.4x |
| INT8 | 70 | 81.5% | 80.9% | 8.8 | 1.6x |
| INT4 (GPTQ) | 35 | 80.3% | 79.2% | 8.6 | 1.8x |
| INT4 (AWQ) | 35 | 80.7% | 79.8% | 8.7 | 1.8x |
| Q4_K_M (GGUF) | 38 | 80.1% | 78.5% | 8.5 | 1.5x |
The pattern is clear: FP8 and INT8 quantization are nearly lossless for most applications. INT4 introduces measurable but often acceptable degradation.
Mixed-Precision Strategies
The most sophisticated deployments do not apply uniform quantization. Instead, they use different precision for different components:
- Attention layers: Keep at FP8 or higher — these are critical for quality
- FFN layers: Quantize more aggressively to INT4 — these tolerate compression better
- Embedding layers: Keep at FP16 — quantization here disproportionately hurts quality
- KV cache: Quantize to FP8 — saves memory at long context with minimal impact
# Mixed-precision quantization configuration example
layer_quant_config = {
"attention.q_proj": "fp8",
"attention.k_proj": "fp8",
"attention.v_proj": "fp8",
"attention.o_proj": "fp8",
"mlp.gate_proj": "int4",
"mlp.up_proj": "int4",
"mlp.down_proj": "int4",
"embed_tokens": "fp16",
"lm_head": "fp16",
}
Quantization-Aware Training (QAT)
For teams willing to invest in retraining, QAT simulates quantization during the training process, allowing the model to adapt its weights to perform well at lower precision. QAT models consistently outperform post-training quantized models at the same bit width, typically by 1-3 percentage points.
The cost is significant — QAT requires a full or partial training run — but for models being deployed at massive scale, the per-query savings from serving a QAT INT4 model vs a PTQ INT4 model can justify the upfront investment.
Practical Deployment Recommendations
Start with FP8: It is nearly lossless, halves memory, and is natively supported on modern GPU architectures. This should be the default for production serving.
Use INT4 for cost-constrained or edge deployments: When GPU budget is limited, GPTQ or AWQ INT4 quantization provides the best quality at 4-bit precision.
Benchmark on your actual task: Academic benchmarks may not reflect your specific use case. Always evaluate quantized models on representative examples from your production workload.
Quantize the KV cache separately: Even if you serve weights in FP8, quantizing the KV cache to FP8 saves substantial memory at long context lengths with minimal quality impact.
Consider the full serving stack: Quantization interacts with other optimizations (batching, speculative decoding, paged attention). Test the complete pipeline, not just isolated components.
Quantization is not a compromise — at FP8, it is essentially free performance. At INT4, it is an engineering trade-off that, when done correctly, enables deployments that would otherwise require 4x the hardware budget.
Frequently Asked Questions
What is model quantization in AI?
Quantization reduces the numerical precision of model weights and activations from higher-precision formats like FP16 to lower-precision formats like INT8, FP8, or INT4. A 70-billion parameter model that requires approximately 140 GB of GPU memory in FP16 can fit in just 35 GB at INT4 precision. Modern quantization techniques achieve this compression with minimal quality degradation, making large models deployable on significantly less expensive hardware.
What is the difference between FP8, INT8, and INT4 quantization?
FP8 retains floating-point representation at 8 bits and is widely considered the new default serving precision, delivering near-zero quality loss with 2x memory savings. INT8 uses integer representation and reduces memory by 2x with slightly more quality risk than FP8. INT4 achieves 4x memory reduction but requires calibration-based techniques like GPTQ or AWQ to maintain acceptable output quality, with typical quality degradation of 1 to 3 percentage points on benchmarks.
How does quantization affect model performance and accuracy?
At FP8 precision, quantization is essentially free performance with quality indistinguishable from FP16 on most tasks. At INT8, quality loss is under 1 percentage point for well-calibrated models. At INT4, quality degradation ranges from 1 to 5 percentage points depending on the technique used and model architecture. Post-training quantization methods like GPTQ and AWQ minimize this loss by calibrating on representative data, and mixing precision levels across different layers can further optimize the accuracy-efficiency trade-off.
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.