Preparing Fine-Tuning Datasets: Data Collection, Cleaning, and Formatting

Data Quality Determines Model Quality

The most common reason fine-tuning fails is poor training data. A model trained on 200 high-quality examples will outperform one trained on 5,000 noisy, inconsistent examples. The principle is simple: your fine-tuned model will replicate whatever patterns exist in your training data — including mistakes, inconsistencies, and formatting errors.

This guide covers the full pipeline from raw data collection to a validated, production-ready training dataset.

Collecting Training Examples

The best training examples come from real production interactions that were reviewed and corrected by domain experts. There are several reliable sources.

Production logs. If you already have an LLM-powered application, filter logs for interactions where the model performed well. Have a domain expert verify each one.

Expert annotation. Give domain experts input prompts and have them write ideal responses. This is expensive but produces the highest quality data.

Existing documentation. Convert FAQs, knowledge base articles, or support tickets into prompt-response pairs.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import json
from dataclasses import dataclass, asdict
from typing import Optional

@dataclass
class TrainingExample:
    system_prompt: str
    user_message: str
    assistant_response: str
    source: str
    quality_score: Optional[float] = None

    def to_jsonl_format(self) -> dict:
        return {
            "messages": [
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": self.user_message},
                {"role": "assistant", "content": self.assistant_response},
            ]
        }

def collect_from_production_logs(
    logs: list[dict],
    min_rating: float = 4.0,
    system_prompt: str = "",
) -> list[TrainingExample]:
    """Filter production logs for high-quality interactions."""
    examples = []
    for log in logs:
        if log.get("user_rating", 0) >= min_rating:
            examples.append(TrainingExample(
                system_prompt=system_prompt,
                user_message=log["user_input"],
                assistant_response=log["assistant_output"],
                source="production_logs",
                quality_score=log["user_rating"],
            ))
    return examples

Cleaning and Normalizing

Raw data is messy. Before it becomes training data, it needs to be cleaned.

import re
import unicodedata

def clean_text(text: str) -> str:
    """Normalize and clean a text string for training."""
    # Normalize Unicode characters
    text = unicodedata.normalize("NFKC", text)

    # Remove zero-width characters
    text = re.sub(r"[\u200b-\u200f\u2028-\u202f\ufeff]", "", text)

    # Normalize whitespace: collapse multiple spaces, strip leading/trailing
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = text.strip()

    # Remove common artifacts from copy-paste
    text = text.replace("\xa0", " ")  # non-breaking space

    return text

def clean_example(example: TrainingExample) -> TrainingExample:
    """Apply cleaning to all text fields."""
    return TrainingExample(
        system_prompt=clean_text(example.system_prompt),
        user_message=clean_text(example.user_message),
        assistant_response=clean_text(example.assistant_response),
        source=example.source,
        quality_score=example.quality_score,
    )

Deduplication

Duplicate or near-duplicate examples bias the model and waste training budget. Use both exact deduplication and fuzzy matching.

import hashlib
from difflib import SequenceMatcher

def exact_dedup(examples: list[TrainingExample]) -> list[TrainingExample]:
    """Remove exact duplicates based on user+assistant content hash."""
    seen = set()
    unique = []
    for ex in examples:
        content = ex.user_message + "|||" + ex.assistant_response
        content_hash = hashlib.sha256(content.encode()).hexdigest()
        if content_hash not in seen:
            seen.add(content_hash)
            unique.append(ex)
    return unique

def fuzzy_dedup(
    examples: list[TrainingExample],
    similarity_threshold: float = 0.85,
) -> list[TrainingExample]:
    """Remove near-duplicates using sequence similarity."""
    unique = []
    for ex in examples:
        is_duplicate = False
        for kept in unique:
            sim = SequenceMatcher(
                None, ex.user_message, kept.user_message
            ).ratio()
            if sim > similarity_threshold:
                is_duplicate = True
                break
        if not is_duplicate:
            unique.append(ex)
    return unique

Diversity Analysis

A good training dataset covers the full range of inputs your model will encounter. Analyze the distribution of topics, lengths, and complexity.

from collections import Counter

def analyze_diversity(examples: list[TrainingExample]) -> dict:
    """Analyze dataset diversity across multiple dimensions."""
    user_lengths = [len(ex.user_message.split()) for ex in examples]
    assistant_lengths = [len(ex.assistant_response.split()) for ex in examples]

    # Simple keyword-based topic detection
    topic_keywords = {
        "billing": ["invoice", "payment", "charge", "refund", "bill"],
        "technical": ["error", "bug", "crash", "install", "update"],
        "account": ["password", "login", "account", "profile", "settings"],
    }

    topic_counts = Counter()
    for ex in examples:
        text = ex.user_message.lower()
        matched = False
        for topic, keywords in topic_keywords.items():
            if any(kw in text for kw in keywords):
                topic_counts[topic] += 1
                matched = True
        if not matched:
            topic_counts["other"] += 1

    return {
        "total_examples": len(examples),
        "avg_user_length": sum(user_lengths) / len(user_lengths),
        "avg_assistant_length": sum(assistant_lengths) / len(assistant_lengths),
        "min_user_length": min(user_lengths),
        "max_user_length": max(user_lengths),
        "topic_distribution": dict(topic_counts),
    }

Building the Final JSONL File

Once your data is collected, cleaned, deduplicated, and analyzed, assemble the final training and validation files.

import json
import random

def build_dataset(
    examples: list[TrainingExample],
    train_path: str = "train.jsonl",
    val_path: str = "val.jsonl",
    val_split: float = 0.1,
    seed: int = 42,
) -> dict:
    """Split examples and write JSONL files."""
    random.seed(seed)
    shuffled = examples.copy()
    random.shuffle(shuffled)

    split_idx = int(len(shuffled) * (1 - val_split))
    train = shuffled[:split_idx]
    val = shuffled[split_idx:]

    for path, data in [(train_path, train), (val_path, val)]:
        with open(path, "w") as f:
            for ex in data:
                f.write(json.dumps(ex.to_jsonl_format()) + "\n")

    return {
        "train_count": len(train),
        "val_count": len(val),
        "train_path": train_path,
        "val_path": val_path,
    }

FAQ

How many training examples do I need for a good fine-tuned model?

There is no universal minimum, but practical results follow a pattern. With 50-100 examples you get noticeable formatting and style improvements. With 200-500 examples you get reliable domain-specific behavior. Beyond 1,000 examples, gains diminish unless you are teaching genuinely complex reasoning. Always start small, evaluate, and add more data only where the model is weakest.

Should the system prompt be the same across all training examples?

Keeping a consistent system prompt across all examples is recommended when fine-tuning for a single task. The model learns the association between that system prompt and the expected behavior. If you need the model to handle multiple tasks, you can vary the system prompt — but make sure each variant has enough examples for the model to learn the pattern.

How do I handle imbalanced topic distributions in my dataset?

Undersample over-represented topics and manually create or augment examples for under-represented ones. If 80% of your examples are about billing and 5% are about technical issues, the model will handle billing well but struggle with technical queries. Aim for a distribution that roughly matches your production traffic, with slight oversampling of rare but important categories.

#FineTuning #DatasetPreparation #DataQuality #LLMTraining #DataEngineering #AgenticAI #LearnAI #AIEngineering

Preparing Fine-Tuning Datasets: Data Collection, Cleaning, and Formatting

Data Quality Determines Model Quality

Collecting Training Examples

Cleaning and Normalizing

Deduplication

Diversity Analysis

Building the Final JSONL File

FAQ

How many training examples do I need for a good fine-tuned model?

Should the system prompt be the same across all training examples?

How do I handle imbalanced topic distributions in my dataset?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding