Skip to content
Learn Agentic AI13 min read0 views

Understanding LLM Training: Pre-training, Fine-tuning, and RLHF

Learn the complete LLM training pipeline from pre-training on internet-scale data through supervised fine-tuning and RLHF alignment, with practical code examples at each stage.

The Three Stages of Building an LLM

Creating a useful LLM is not a single training run — it is a three-stage pipeline. Each stage transforms the model's behavior in a distinct way:

  1. Pre-training: Teach the model language by predicting the next token on trillions of words
  2. Supervised Fine-tuning (SFT): Teach the model to follow instructions using curated examples
  3. Reinforcement Learning from Human Feedback (RLHF): Align the model with human preferences

Understanding this pipeline explains why LLMs behave the way they do and gives you the knowledge to customize them for your applications.

Stage 1: Pre-training — Learning Language Itself

Pre-training is the most expensive and foundational stage. The model learns grammar, facts, reasoning patterns, code, and multilingual capabilities by processing massive text datasets.

The objective is simple: given a sequence of tokens, predict the next one. The model reads billions of web pages, books, code repositories, and articles, always trying to predict what comes next:

# Conceptual pre-training loop (simplified)
import torch
import torch.nn as nn

def pretraining_step(model, batch, optimizer):
    """
    One step of next-token prediction training.

    batch contains:
    - input_ids: token sequences [batch_size, seq_len]
    - labels: same as input_ids shifted by 1
    """
    input_ids = batch["input_ids"]        # e.g., [The, cat, sat, on]
    labels = batch["labels"]              # e.g., [cat, sat, on, the]

    # Forward pass: model predicts probability of each next token
    logits = model(input_ids)  # [batch_size, seq_len, vocab_size]

    # Cross-entropy loss: how wrong were the predictions?
    loss_fn = nn.CrossEntropyLoss()
    loss = loss_fn(
        logits.view(-1, logits.size(-1)),  # Flatten to [batch*seq, vocab]
        labels.view(-1),                    # Flatten to [batch*seq]
    )

    # Backward pass: compute gradients
    loss.backward()

    # Update parameters
    optimizer.step()
    optimizer.zero_grad()

    return loss.item()

The scale of pre-training is staggering. Training Llama 3.1 405B required 30.8 million GPU-hours on NVIDIA H100 GPUs. The training data was over 15 trillion tokens. The compute cost is estimated at tens of millions of dollars.

What pre-training teaches:

  • Grammar and syntax of hundreds of languages
  • World knowledge (facts, relationships, geography, history)
  • Code patterns across dozens of programming languages
  • Reasoning patterns (math, logic, common sense)
  • Writing styles (formal, informal, technical, creative)

What pre-training does not teach:

  • Following instructions
  • Being helpful, harmless, and honest
  • Refusing dangerous requests
  • Producing structured outputs on demand

A pre-trained model is a powerful text completion engine, but it is not a useful assistant. It might complete "How do I make a bomb?" with an actual recipe from its training data. This is where the next stages come in.

Stage 2: Supervised Fine-tuning (SFT) — Learning to Follow Instructions

SFT takes the pre-trained model and trains it on curated instruction-response pairs. This transforms the model from a text completer into an instruction follower:

# Example SFT training data
sft_examples = [
    {
        "instruction": "Summarize the following article in 3 bullet points.",
        "input": "Artificial intelligence has transformed healthcare...",
        "output": "- AI enables earlier disease detection through medical imaging analysis\n"
                  "- Machine learning models predict patient outcomes more accurately\n"
                  "- Automated systems reduce administrative burden on healthcare workers",
    },
    {
        "instruction": "Write a Python function that reverses a string.",
        "input": "",
        "output": "def reverse_string(s: str) -> str:\n"
                  "    return s[::-1]",
    },
    {
        "instruction": "Is the following statement true or false? Explain your reasoning.",
        "input": "The Great Wall of China is visible from space with the naked eye.",
        "output": "False. This is a common misconception. The Great Wall is very long "
                  "but only about 6 meters wide, making it too narrow to see from orbit "
                  "without magnification.",
    },
]

You can fine-tune models using the OpenAI fine-tuning API:

from openai import OpenAI

client = OpenAI()

# Step 1: Prepare training data in JSONL format
# Each line: {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}

# Step 2: Upload training file
training_file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune",
)

# Step 3: Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,
        "learning_rate_multiplier": 1.8,
        "batch_size": 4,
    },
)

# Step 4: Monitor progress
while True:
    status = client.fine_tuning.jobs.retrieve(job.id)
    print(f"Status: {status.status}")
    if status.status in ("succeeded", "failed"):
        break

# Step 5: Use the fine-tuned model
if status.status == "succeeded":
    response = client.chat.completions.create(
        model=status.fine_tuned_model,
        messages=[{"role": "user", "content": "Your prompt here"}],
    )

SFT typically uses a few thousand to a few hundred thousand examples. The quality of these examples matters more than the quantity — noisy or contradictory training data degrades model performance.

Stage 3: RLHF — Aligning with Human Preferences

RLHF is what turns an instruction-following model into a model that is helpful, honest, and harmless. It uses human preferences to further refine the model's behavior.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

The RLHF pipeline has two sub-stages:

Step 3a: Train a Reward Model. Human labelers compare pairs of model outputs and indicate which one is better. These preferences train a reward model that can score any output:

# Conceptual reward model training
reward_training_data = [
    {
        "prompt": "Explain quantum computing to a 10-year-old.",
        "chosen": "Imagine you have a magic coin that can be both heads AND tails "
                  "at the same time until you look at it...",
        "rejected": "Quantum computing leverages quantum mechanical phenomena such as "
                    "superposition and entanglement to perform computations...",
    },
    {
        "prompt": "How do I pick a lock?",
        "chosen": "I cannot provide instructions on picking locks, as this could "
                  "facilitate illegal entry. If you are locked out of your own "
                  "property, I recommend contacting a licensed locksmith.",
        "rejected": "To pick a pin tumbler lock, you will need a tension wrench "
                    "and a pick. Insert the tension wrench...",
    },
]

# The reward model learns:
# - Simpler explanations are preferred for simple questions
# - Refusing dangerous requests is preferred
# - Helpful, accurate responses are preferred over vague ones

Step 3b: Optimize the LLM Using the Reward Model. The model generates responses, the reward model scores them, and the LLM is updated to produce higher-scoring responses using Proximal Policy Optimization (PPO):

# Conceptual PPO training loop for RLHF
def rlhf_training_step(policy_model, reward_model, reference_model, prompt):
    """
    One step of RLHF:
    1. Generate a response
    2. Score it with the reward model
    3. Update the policy to increase reward
    4. Penalize divergence from the reference model (KL penalty)
    """
    # Generate response from current policy
    response = policy_model.generate(prompt)

    # Score the response
    reward = reward_model.score(prompt, response)

    # KL divergence penalty: prevent the model from diverging too far
    # from the SFT checkpoint (the reference model)
    kl_penalty = compute_kl_divergence(policy_model, reference_model, prompt, response)

    # Total objective: maximize reward while staying close to reference
    objective = reward - beta * kl_penalty

    # Update policy model using PPO
    ppo_update(policy_model, objective)

The KL divergence penalty is crucial. Without it, the model would learn to exploit the reward model — generating outputs that score highly but are degenerate or nonsensical (a phenomenon called reward hacking).

Modern Alternatives: DPO and Constitutional AI

RLHF is effective but complex. Two alternatives have emerged:

Direct Preference Optimization (DPO) eliminates the reward model entirely. It directly optimizes the policy model using preference pairs, which is simpler and more stable:

# DPO directly uses preference pairs without a reward model
# It implicitly defines a reward function through the policy itself

# Training data is the same as reward model data:
dpo_data = [
    {"prompt": "...", "chosen": "...", "rejected": "..."},
]

# But the optimization is a single supervised learning step
# rather than the complex PPO loop

Constitutional AI (CAI), developed by Anthropic, uses a set of principles (a "constitution") to generate preference data automatically. The model critiques and revises its own outputs based on the constitution, reducing the need for human labelers.

What This Means for Application Developers

Understanding the training pipeline helps you work with LLMs more effectively:

  1. Pre-training knowledge has a cutoff date. The model does not know about events after its training data was collected. Use RAG to provide current information.

  2. SFT explains format sensitivity. The model was trained on specific instruction formats. Following the expected chat format (system/user/assistant roles) produces better results because it matches the fine-tuning data.

  3. RLHF explains safety behavior. When a model refuses a request, it is because RLHF taught it that refusal is preferred in that context. This is not a bug — it is a design choice.

  4. Fine-tuning is accessible. You can fine-tune models on your own data to specialize behavior for your domain without the cost of pre-training.

FAQ

How much does it cost to pre-train an LLM from scratch?

Pre-training a frontier model costs tens to hundreds of millions of dollars in compute alone, not counting data preparation, researcher salaries, or infrastructure. However, you almost never need to pre-train from scratch. Fine-tuning an existing model costs between a few dollars (for small datasets on GPT-4o-mini) and a few thousand dollars (for large datasets on capable models). Most applications should use fine-tuning or prompt engineering rather than pre-training.

What is the difference between fine-tuning and prompt engineering?

Prompt engineering changes the model's behavior through the instructions you provide at inference time — no training is involved. Fine-tuning actually modifies the model's weights using your training data. Fine-tuning is better when you need consistent behavior across many requests, domain-specific knowledge, or a particular output format. Prompt engineering is faster to iterate on and requires no training data. Start with prompt engineering and move to fine-tuning only when you hit its limits.

Can RLHF make a model worse?

Yes. Poor-quality preference data, reward model misalignment, or insufficient KL penalty can all degrade model performance. A phenomenon called "alignment tax" describes cases where RLHF improves safety at the cost of capability. This is why the balance between helpfulness and safety is an active area of research and why different model providers make different trade-offs.


#LLMTraining #Finetuning #RLHF #Pretraining #Alignment #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.