Fine-Tuning with Hugging Face Transformers and PEFT: Complete Tutorial

The Hugging Face Fine-Tuning Stack

Hugging Face provides a complete stack for fine-tuning open-source models. The core libraries are:

transformers — model loading, tokenization, and inference
peft — parameter-efficient fine-tuning (LoRA, QLoRA)
trl — training utilities specifically for LLMs, including SFTTrainer
datasets — data loading and preprocessing
bitsandbytes — quantization support for QLoRA

Together, these libraries handle everything from data loading to model deployment. This tutorial walks through a complete fine-tuning workflow from start to finish.

Environment Setup

# Install required packages
# pip install torch transformers peft trl datasets bitsandbytes accelerate

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import LoraConfig, TaskType
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

# Verify GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

Loading the Base Model with QLoRA

model_name = "meta-llama/Llama-3.1-8B-Instruct"

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Preparing the Dataset

The SFTTrainer works best with datasets in conversational format — a messages column containing lists of role/content dicts.

from datasets import Dataset
import json

def load_training_data(filepath: str) -> Dataset:
    """Load JSONL training data into a Hugging Face Dataset."""
    examples = []
    with open(filepath, "r") as f:
        for line in f:
            data = json.loads(line)
            examples.append({"messages": data["messages"]})
    return Dataset.from_list(examples)

# Load and split dataset
full_dataset = load_training_data("training_data.jsonl")
split = full_dataset.train_test_split(test_size=0.1, seed=42)

train_dataset = split["train"]
eval_dataset = split["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Evaluation examples: {len(eval_dataset)}")

# Inspect one example
print(json.dumps(train_dataset[0]["messages"], indent=2))

Configuring LoRA

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

Setting Up the SFT Trainer

The SFTTrainer from TRL handles chat template formatting, packing, and training loop management.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# Training configuration
training_args = SFTConfig(
    output_dir="./llama3-finetune",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # Effective batch size: 4 * 4 = 16
    gradient_checkpointing=True,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    weight_decay=0.01,
    bf16=True,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    save_total_limit=3,
    max_seq_length=2048,
    packing=False,                    # Set True to pack multiple examples
    report_to="none",                 # Use "wandb" for experiment tracking
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
)

# Check trainable parameters
trainer.model.print_trainable_parameters()

Training

# Start training
train_result = trainer.train()

# Print training metrics
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training runtime: {train_result.metrics['train_runtime']:.0f}s")
print(f"Samples per second: {train_result.metrics['train_samples_per_second']:.1f}")

# Save the LoRA adapter
trainer.save_model("./llama3-finetune/final")
tokenizer.save_pretrained("./llama3-finetune/final")

Evaluation

from transformers import pipeline

# Load the fine-tuned model for inference
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def evaluate_on_test(pipe, test_data, num_samples=20):
    """Run model on test examples and collect results."""
    results = []
    for i in range(min(num_samples, len(test_data))):
        example = test_data[i]
        messages = example["messages"]

        # Use all messages except the last (assistant response) as input
        prompt_messages = messages[:-1]
        expected = messages[-1]["content"]

        output = pipe(
            prompt_messages,
            max_new_tokens=512,
            temperature=0.1,
            do_sample=True,
        )
        generated = output[0]["generated_text"][-1]["content"]

        results.append({
            "input": messages[-2]["content"][:100],
            "expected": expected[:100],
            "generated": generated[:100],
        })

    return results

results = evaluate_on_test(pipe, eval_dataset)
for r in results[:5]:
    print(f"Input:    {r['input']}")
    print(f"Expected: {r['expected']}")
    print(f"Got:      {r['generated']}")
    print("---")

Pushing to Hugging Face Hub

# Login to Hugging Face (run once)
# huggingface-cli login --token hf_YOUR_TOKEN

# Push the LoRA adapter to Hub
trainer.model.push_to_hub(
    "your-username/llama3-medical-coder-lora",
    private=True,
)
tokenizer.push_to_hub(
    "your-username/llama3-medical-coder-lora",
    private=True,
)

# To merge and push the full model:
from peft import PeftModel, AutoPeftModelForCausalLM

merged = trainer.model.merge_and_unload()
merged.push_to_hub(
    "your-username/llama3-medical-coder-merged",
    private=True,
)

FAQ

What is the difference between SFTTrainer and the standard Trainer?

SFTTrainer (Supervised Fine-Tuning Trainer) from TRL is specifically designed for LLM fine-tuning. It automatically handles chat template formatting, supports packing multiple short examples into a single sequence for efficiency, and integrates seamlessly with PEFT adapters. The standard Trainer from transformers works for general training but requires you to handle tokenization, padding, and label masking manually for language model fine-tuning.

How do I choose between packing=True and packing=False?

Packing concatenates multiple training examples into a single sequence to maximize GPU utilization. Enable packing when your examples are short (under 25% of max_seq_length) and you want faster training. Disable packing when example boundaries matter — for instance, if your system prompts vary between examples, packing can create confusing boundaries. Start with packing disabled and enable it only if training is slow due to short sequences.

How do I resume training from a checkpoint if it gets interrupted?

SFTTrainer saves checkpoints automatically based on your save_strategy configuration. To resume, pass the checkpoint directory to the resume_from_checkpoint parameter: trainer.train(resume_from_checkpoint="./llama3-finetune/checkpoint-150"). The trainer restores the model weights, optimizer state, learning rate schedule, and data loader position so training continues exactly where it left off.

#HuggingFace #PEFT #Transformers #TRL #FineTuning #SFT #AgenticAI #LearnAI #AIEngineering