Skip to content
Learn Agentic AI13 min read0 views

LoRA and QLoRA: Parameter-Efficient Fine-Tuning for Open-Source LLMs

Understand how LoRA and QLoRA enable fine-tuning of large language models on consumer hardware by training only a small fraction of parameters, with practical examples using Hugging Face and PEFT.

The Problem with Full Fine-Tuning

Full fine-tuning updates every parameter in a model. For a 7-billion parameter model, that means storing 7 billion gradients, 7 billion optimizer states, and 7 billion updated weights during training. A single training run for Llama 3 8B with full fine-tuning requires roughly 60-80 GB of GPU memory — well beyond a single consumer GPU.

LoRA (Low-Rank Adaptation) solves this by freezing the original model weights and injecting small trainable matrices into specific layers. Instead of updating 7 billion parameters, you train 1-10 million parameters. QLoRA goes further by quantizing the frozen base model to 4-bit precision, cutting memory requirements in half again.

How LoRA Works

LoRA decomposes weight updates into two small matrices. Instead of computing a full weight update matrix W (dimensions d x d, potentially millions of parameters), LoRA computes two matrices: A (d x r) and B (r x d), where r (the rank) is much smaller than d — typically 8, 16, or 32.

The effective weight update is the product A * B, which has the same dimensions as W but is parameterized by far fewer values. During inference, these low-rank matrices are merged back into the base weights, so there is zero additional latency.

import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    def __init__(self, original_layer: nn.Linear, rank: int = 8, alpha: float = 16.0):
        super().__init__()
        self.original = original_layer
        self.scaling = alpha / rank
        self.original.weight.requires_grad = False  # Freeze original
        d_in, d_out = original_layer.in_features, original_layer.out_features
        self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(d_out, rank))

    def forward(self, x):
        return self.original(x) + (x @ self.lora_A.T @ self.lora_B.T) * self.scaling

# A 4096x4096 layer = 16.7M params. LoRA rank 16 = 131K params (0.78%)

QLoRA: Adding Quantization

QLoRA combines LoRA with 4-bit quantization of the base model. The frozen weights are stored in NF4 (NormalFloat4) format, which is specifically designed for normally distributed neural network weights. This reduces the base model memory footprint by roughly 4x compared to 16-bit.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# QLoRA configuration: 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Nested quantization for extra savings
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare model for k-bit training (handles gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,                          # Rank
    lora_alpha=32,                 # Scaling factor
    target_modules=[               # Which layers to apply LoRA to
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA adapters
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 8,043,847,680 || trainable%: 0.1695

Choosing the Right Rank

The rank (r) controls the capacity of the LoRA adaptation. Higher ranks can learn more complex transformations but use more memory and risk overfitting.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def estimate_lora_params(
    hidden_size: int,
    num_layers: int,
    rank: int,
    num_target_modules: int = 7,  # q, k, v, o, gate, up, down
) -> dict:
    """Estimate trainable parameters for different LoRA ranks."""
    params_per_layer = num_target_modules * 2 * hidden_size * rank
    total_params = params_per_layer * num_layers

    return {
        "rank": rank,
        "params_per_layer": f"{params_per_layer:,}",
        "total_trainable": f"{total_params:,}",
        "total_mb": f"{total_params * 2 / 1024**2:.1f} MB",  # bf16
    }

# Llama 3.1 8B: hidden_size=4096, 32 layers
for r in [4, 8, 16, 32, 64]:
    result = estimate_lora_params(4096, 32, r)
    print(f"Rank {r:2d}: {result['total_trainable']:>12s} params ({result['total_mb']})")

# Rank  4:    7,340,032 params (14.0 MB)
# Rank  8:   14,680,064 params (28.0 MB)
# Rank 16:   29,360,128 params (56.0 MB)
# Rank 32:   58,720,256 params (112.0 MB)
# Rank 64:  117,440,512 params (224.0 MB)

Practical guidelines: Use rank 8 for simple style and format tasks. Use rank 16-32 for moderate domain adaptation. Use rank 64 only for complex tasks with abundant training data.

Memory Requirements Comparison

Configuration Base Model Adapters Optimizer Total GPU RAM
Full fine-tune (bf16) 16 GB 48 GB ~64 GB
LoRA (bf16 base) 16 GB 56 MB 168 MB ~18 GB
QLoRA (4-bit base) 4.5 GB 56 MB 168 MB ~6 GB

QLoRA makes it possible to fine-tune an 8B model on a single 8 GB GPU — a consumer RTX 3070 or even a free Google Colab T4.

Merging and Deploying LoRA Adapters

After training, merge the LoRA weights back into the base model for deployment with zero overhead.

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model in full precision
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cpu",
)

# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")
merged_model = model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

FAQ

What is the difference between LoRA rank and LoRA alpha?

Rank (r) determines the size of the low-rank matrices and thus the capacity of the adaptation. Alpha controls the scaling factor applied to the LoRA output. The effective scaling is alpha/rank. A common pattern is to set alpha to 2x the rank (e.g., r=16, alpha=32). Higher alpha amplifies the LoRA contribution relative to the base model.

Can I apply multiple LoRA adapters to the same model?

Yes. You can train separate LoRA adapters for different tasks and switch between them at inference time without reloading the base model. Libraries like PEFT support loading multiple adapters and selecting which one is active. You can even merge multiple adapters, though this requires care to avoid conflicting weight updates.

Is QLoRA quality worse than full LoRA due to the 4-bit quantization?

Research shows that QLoRA matches full-precision LoRA quality in most benchmarks. The key insight is that quantization only affects the frozen base weights, not the trainable LoRA parameters, which remain in bfloat16. The double quantization technique in QLoRA further reduces the quantization error. In practice, the quality difference is negligible for most fine-tuning tasks.


#LoRA #QLoRA #PEFT #FineTuning #OpenSourceLLMs #HuggingFace #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.