Preparing Fine-Tuning Datasets: Data Collection, Cleaning, and Formatting
Master the art of building high-quality fine-tuning datasets with practical techniques for data collection, cleaning, deduplication, format validation, and diversity analysis.
Data Quality Determines Model Quality
The most common reason fine-tuning fails is poor training data. A model trained on 200 high-quality examples will outperform one trained on 5,000 noisy, inconsistent examples. The principle is simple: your fine-tuned model will replicate whatever patterns exist in your training data — including mistakes, inconsistencies, and formatting errors.
This guide covers the full pipeline from raw data collection to a validated, production-ready training dataset.
Collecting Training Examples
The best training examples come from real production interactions that were reviewed and corrected by domain experts. There are several reliable sources.
Production logs. If you already have an LLM-powered application, filter logs for interactions where the model performed well. Have a domain expert verify each one.
Expert annotation. Give domain experts input prompts and have them write ideal responses. This is expensive but produces the highest quality data.
Existing documentation. Convert FAQs, knowledge base articles, or support tickets into prompt-response pairs.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import json
from dataclasses import dataclass, asdict
from typing import Optional
@dataclass
class TrainingExample:
system_prompt: str
user_message: str
assistant_response: str
source: str
quality_score: Optional[float] = None
def to_jsonl_format(self) -> dict:
return {
"messages": [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": self.user_message},
{"role": "assistant", "content": self.assistant_response},
]
}
def collect_from_production_logs(
logs: list[dict],
min_rating: float = 4.0,
system_prompt: str = "",
) -> list[TrainingExample]:
"""Filter production logs for high-quality interactions."""
examples = []
for log in logs:
if log.get("user_rating", 0) >= min_rating:
examples.append(TrainingExample(
system_prompt=system_prompt,
user_message=log["user_input"],
assistant_response=log["assistant_output"],
source="production_logs",
quality_score=log["user_rating"],
))
return examples
Cleaning and Normalizing
Raw data is messy. Before it becomes training data, it needs to be cleaned.
import re
import unicodedata
def clean_text(text: str) -> str:
"""Normalize and clean a text string for training."""
# Normalize Unicode characters
text = unicodedata.normalize("NFKC", text)
# Remove zero-width characters
text = re.sub(r"[\u200b-\u200f\u2028-\u202f\ufeff]", "", text)
# Normalize whitespace: collapse multiple spaces, strip leading/trailing
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text)
text = text.strip()
# Remove common artifacts from copy-paste
text = text.replace("\xa0", " ") # non-breaking space
return text
def clean_example(example: TrainingExample) -> TrainingExample:
"""Apply cleaning to all text fields."""
return TrainingExample(
system_prompt=clean_text(example.system_prompt),
user_message=clean_text(example.user_message),
assistant_response=clean_text(example.assistant_response),
source=example.source,
quality_score=example.quality_score,
)
Deduplication
Duplicate or near-duplicate examples bias the model and waste training budget. Use both exact deduplication and fuzzy matching.
import hashlib
from difflib import SequenceMatcher
def exact_dedup(examples: list[TrainingExample]) -> list[TrainingExample]:
"""Remove exact duplicates based on user+assistant content hash."""
seen = set()
unique = []
for ex in examples:
content = ex.user_message + "|||" + ex.assistant_response
content_hash = hashlib.sha256(content.encode()).hexdigest()
if content_hash not in seen:
seen.add(content_hash)
unique.append(ex)
return unique
def fuzzy_dedup(
examples: list[TrainingExample],
similarity_threshold: float = 0.85,
) -> list[TrainingExample]:
"""Remove near-duplicates using sequence similarity."""
unique = []
for ex in examples:
is_duplicate = False
for kept in unique:
sim = SequenceMatcher(
None, ex.user_message, kept.user_message
).ratio()
if sim > similarity_threshold:
is_duplicate = True
break
if not is_duplicate:
unique.append(ex)
return unique
Diversity Analysis
A good training dataset covers the full range of inputs your model will encounter. Analyze the distribution of topics, lengths, and complexity.
from collections import Counter
def analyze_diversity(examples: list[TrainingExample]) -> dict:
"""Analyze dataset diversity across multiple dimensions."""
user_lengths = [len(ex.user_message.split()) for ex in examples]
assistant_lengths = [len(ex.assistant_response.split()) for ex in examples]
# Simple keyword-based topic detection
topic_keywords = {
"billing": ["invoice", "payment", "charge", "refund", "bill"],
"technical": ["error", "bug", "crash", "install", "update"],
"account": ["password", "login", "account", "profile", "settings"],
}
topic_counts = Counter()
for ex in examples:
text = ex.user_message.lower()
matched = False
for topic, keywords in topic_keywords.items():
if any(kw in text for kw in keywords):
topic_counts[topic] += 1
matched = True
if not matched:
topic_counts["other"] += 1
return {
"total_examples": len(examples),
"avg_user_length": sum(user_lengths) / len(user_lengths),
"avg_assistant_length": sum(assistant_lengths) / len(assistant_lengths),
"min_user_length": min(user_lengths),
"max_user_length": max(user_lengths),
"topic_distribution": dict(topic_counts),
}
Building the Final JSONL File
Once your data is collected, cleaned, deduplicated, and analyzed, assemble the final training and validation files.
import json
import random
def build_dataset(
examples: list[TrainingExample],
train_path: str = "train.jsonl",
val_path: str = "val.jsonl",
val_split: float = 0.1,
seed: int = 42,
) -> dict:
"""Split examples and write JSONL files."""
random.seed(seed)
shuffled = examples.copy()
random.shuffle(shuffled)
split_idx = int(len(shuffled) * (1 - val_split))
train = shuffled[:split_idx]
val = shuffled[split_idx:]
for path, data in [(train_path, train), (val_path, val)]:
with open(path, "w") as f:
for ex in data:
f.write(json.dumps(ex.to_jsonl_format()) + "\n")
return {
"train_count": len(train),
"val_count": len(val),
"train_path": train_path,
"val_path": val_path,
}
FAQ
How many training examples do I need for a good fine-tuned model?
There is no universal minimum, but practical results follow a pattern. With 50-100 examples you get noticeable formatting and style improvements. With 200-500 examples you get reliable domain-specific behavior. Beyond 1,000 examples, gains diminish unless you are teaching genuinely complex reasoning. Always start small, evaluate, and add more data only where the model is weakest.
Should the system prompt be the same across all training examples?
Keeping a consistent system prompt across all examples is recommended when fine-tuning for a single task. The model learns the association between that system prompt and the expected behavior. If you need the model to handle multiple tasks, you can vary the system prompt — but make sure each variant has enough examples for the model to learn the pattern.
How do I handle imbalanced topic distributions in my dataset?
Undersample over-represented topics and manually create or augment examples for under-represented ones. If 80% of your examples are about billing and 5% are about technical issues, the model will handle billing well but struggle with technical queries. Aim for a distribution that roughly matches your production traffic, with slight oversampling of rare but important categories.
#FineTuning #DatasetPreparation #DataQuality #LLMTraining #DataEngineering #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.