Running LLMs on Consumer GPUs: Quantization with GPTQ, AWQ, and GGUF

Why Quantization Matters for Agent Developers

A full-precision (FP16) Llama 3.1 70B model requires approximately 140 GB of GPU VRAM — far beyond the 24 GB available on a high-end consumer GPU like the RTX 4090. Quantization compresses model weights from 16-bit floating point to 4-bit or 8-bit integers, reducing memory requirements by 2-4x with surprisingly small quality losses.

For agent developers, quantization is the difference between needing a $15,000 multi-GPU server and running a capable model on a single consumer card. A 4-bit quantized 70B model fits in approximately 35 GB — still too much for one GPU, but manageable with two 24 GB cards.

Quantization Methods Compared

GPTQ (GPU-Optimized Post-Training Quantization)

GPTQ quantizes weights to 4-bit integers using a calibration dataset to minimize the quantization error layer by layer. It was one of the first practical methods for 4-bit quantization and remains widely supported.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/Llama-2-7B-Chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

inputs = tokenizer("Explain AI agents in simple terms:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Strengths: Wide ecosystem support, good quality at 4-bit, fast GPU inference via CUDA kernels. Weaknesses: Slow quantization process (hours), GPU-only inference.

AWQ (Activation-Aware Weight Quantization)

AWQ improves on GPTQ by recognizing that not all weights matter equally. It identifies the 1% of "salient" weight channels that have the largest impact on model activations and preserves them at higher precision, while aggressively quantizing the rest.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "TheBloke/Llama-2-7B-Chat-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_path,
    fuse_layers=True,  # Kernel fusion for faster inference
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

prompt = "What is retrieval-augmented generation?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Strengths: Better quality than GPTQ at the same bit-width, faster quantization, excellent with vLLM. Weaknesses: GPU-only, slightly newer ecosystem.

GGUF (GPT-Generated Unified Format)

GGUF is the format used by llama.cpp and Ollama. Unlike GPTQ and AWQ, which target GPU-only inference, GGUF supports CPU, GPU, and hybrid CPU+GPU execution. This makes it uniquely suited for machines with limited VRAM.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# Using llama.cpp directly
./llama-cli -m models/llama-3.1-8b-q4_K_M.gguf \
    -p "Describe the role of an AI agent:" \
    -n 256 \
    --n-gpu-layers 20  # Offload 20 layers to GPU, rest on CPU

In Python, use the llama-cpp-python package:

from llama_cpp import Llama

llm = Llama(
    model_path="./models/llama-3.1-8b-q4_K_M.gguf",
    n_gpu_layers=25,  # Partial GPU offload
    n_ctx=4096,
    verbose=False,
)

output = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful agent."},
        {"role": "user", "content": "Explain quantization in one paragraph."},
    ],
    max_tokens=200,
)

print(output["choices"][0]["message"]["content"])

Strengths: CPU+GPU hybrid, runs on any hardware, many quantization variants (q2_K through q8_0), Ollama ecosystem. Weaknesses: Slower than GPTQ/AWQ on pure GPU workloads.

Memory Requirements Quick Reference

Model	FP16	GPTQ 4-bit	AWQ 4-bit	GGUF Q4_K_M
7B	14 GB	4 GB	4 GB	4.4 GB
13B	26 GB	7.5 GB	7.5 GB	8 GB
34B	68 GB	18 GB	18 GB	20 GB
70B	140 GB	35 GB	35 GB	40 GB

Quality Tradeoffs

At 4-bit quantization, expect a 1-3% degradation on benchmarks like MMLU and HumanEval compared to FP16. In practice, for agent tasks involving tool calling, classification, and extraction, this difference is often imperceptible. Where quality loss becomes noticeable is in creative writing, nuanced reasoning, and tasks requiring precise numerical computation.

AWQ consistently outperforms GPTQ by 0.5-1% on benchmarks at the same bit-width, thanks to its activation-aware strategy.

Choosing the Right Method for Your Agent

Pure GPU, production serving with vLLM: Use AWQ. Best quality-per-bit and natively supported by vLLM.
Local development with Ollama: Use GGUF. Ollama handles everything automatically.
Mixed CPU+GPU or CPU-only: Use GGUF. Only format that supports hybrid execution.
Legacy compatibility: Use GPTQ. Broadest ecosystem support.

FAQ

Does quantization affect tool-calling reliability?

For structured outputs like tool calls, 4-bit quantization has minimal impact on reliable models like Llama 3.1 and Mistral. The model still follows the expected JSON format. Where you might see degradation is in complex multi-step reasoning about which tool to call — if this happens, try 5-bit or 6-bit quantization as a middle ground.

Can I quantize a model myself?

Yes. AutoGPTQ and AutoAWQ both provide scripts for quantization. You need the full-precision model and a small calibration dataset (128-256 samples). Quantization takes 1-4 hours on a single GPU for a 7B model.

Is 2-bit quantization usable?

2-bit quantization (q2_K in GGUF) significantly degrades quality and is generally not recommended for agent workloads. You lose too much precision for reliable instruction following and tool calling. 4-bit is the sweet spot for balancing size and quality.

#Quantization #GPTQ #AWQ #GGUF #GPUOptimization #AgenticAI #LearnAI #AIEngineering