Synthetic Data Generation for Fine-Tuning: Using LLMs to Create Training Data
Learn how to use large language models to generate, filter, and validate synthetic training data for fine-tuning smaller models, with techniques for ensuring quality, diversity, and deduplication.
Why Generate Synthetic Training Data
The biggest bottleneck in fine-tuning is not compute or infrastructure — it is high-quality training data. Expert annotation is expensive and slow. Production logs may not cover edge cases. Synthetic data generation uses a capable LLM (the "teacher") to create training examples for a smaller model (the "student").
This approach is used extensively in production. Many of the best open-source models were trained partly on synthetic data generated by larger models. The key is quality control — raw LLM output is not training-ready. It requires filtering, validation, and deduplication.
The Generation Pipeline
A robust synthetic data pipeline has four stages: seed creation, generation, filtering, and deduplication.
from openai import OpenAI
import json
import random
from typing import Optional
client = OpenAI()
def generate_seed_topics(domain: str, count: int = 50) -> list[str]:
"""Generate diverse seed topics for a domain."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Generate diverse, specific topics. Output one topic per line, no numbering."
},
{
"role": "user",
"content": f"List {count} diverse topics for a {domain} assistant. "
f"Cover common cases, edge cases, and tricky scenarios."
},
],
temperature=1.0, # High temperature for diversity
)
topics = [
line.strip()
for line in response.choices[0].message.content.strip().split("\n")
if line.strip()
]
return topics
# Generate seeds
topics = generate_seed_topics("customer support for a SaaS billing platform")
print(f"Generated {len(topics)} seed topics")
Generating Training Examples
For each seed topic, generate a complete conversation. Use detailed system prompts to control the format and quality of the output.
GENERATION_PROMPT = """You are generating training data for a customer support AI.
Given a topic, create a realistic customer support interaction.
Requirements:
- The customer message should sound natural, as if written by a real person
- Include relevant details (account numbers, dates, specific issues)
- The assistant response should be helpful, accurate, and follow company policy
- Keep responses concise but complete
- Vary the tone: some customers are frustrated, some are polite, some are confused
Output EXACTLY this JSON format:
{
"user_message": "the customer's message",
"assistant_response": "the support agent's response"
}"""
def generate_example(
topic: str,
system_prompt: str,
model: str = "gpt-4o",
) -> Optional[dict]:
"""Generate a single training example from a seed topic."""
try:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": GENERATION_PROMPT},
{"role": "user", "content": f"Topic: {topic}"},
],
temperature=0.8,
response_format={"type": "json_object"},
)
data = json.loads(response.choices[0].message.content)
if "user_message" not in data or "assistant_response" not in data:
return None
return {
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": data["user_message"]},
{"role": "assistant", "content": data["assistant_response"]},
]
}
except (json.JSONDecodeError, KeyError):
return None
# Generate examples in batch
SYSTEM_PROMPT = "You are a helpful customer support agent for BillingPro, a SaaS billing platform."
def generate_batch(
topics: list[str],
system_prompt: str,
examples_per_topic: int = 3,
) -> list[dict]:
"""Generate multiple examples per topic."""
all_examples = []
for topic in topics:
for _ in range(examples_per_topic):
example = generate_example(topic, system_prompt)
if example:
all_examples.append(example)
print(f"Generated {len(all_examples)} examples from {len(topics)} topics")
return all_examples
Quality Filtering
Not all generated examples are good enough for training. Filter by length, coherence, and content quality.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def quality_filter(examples: list[dict]) -> list[dict]:
"""Filter examples based on quality heuristics."""
filtered = []
for ex in examples:
messages = ex["messages"]
user_msg = messages[1]["content"]
assistant_msg = messages[2]["content"]
# Length checks
user_words = len(user_msg.split())
assistant_words = len(assistant_msg.split())
if user_words < 5 or user_words > 500:
continue
if assistant_words < 10 or assistant_words > 1000:
continue
# Content checks
if assistant_msg.strip().startswith("I'm sorry, I can't"):
continue
# Check for placeholder text
placeholders = ["[insert", "[your", "xxx", "placeholder"]
if any(p in assistant_msg.lower() for p in placeholders):
continue
# Check assistant actually addresses the user's question
if len(assistant_msg) < len(user_msg) * 0.3:
continue
filtered.append(ex)
print(f"Quality filter: {len(filtered)}/{len(examples)} passed")
return filtered
Deduplication for Synthetic Data
LLMs tend to generate similar outputs even with different seeds. Aggressive deduplication is essential.
import hashlib
from difflib import SequenceMatcher
def dedup_synthetic(examples: list[dict], threshold: float = 0.80) -> list[dict]:
"""Remove near-duplicate synthetic examples."""
unique = []
seen_hashes = set()
for ex in examples:
user_msg = ex["messages"][1]["content"]
assistant_msg = ex["messages"][2]["content"]
combined = user_msg + assistant_msg
# Exact dedup
content_hash = hashlib.md5(combined.encode()).hexdigest()
if content_hash in seen_hashes:
continue
seen_hashes.add(content_hash)
# Fuzzy dedup against all kept examples
is_dup = False
for kept in unique:
kept_combined = kept["messages"][1]["content"] + kept["messages"][2]["content"]
similarity = SequenceMatcher(None, combined, kept_combined).ratio()
if similarity > threshold:
is_dup = True
break
if not is_dup:
unique.append(ex)
print(f"Dedup: {len(unique)}/{len(examples)} unique")
return unique
Full Pipeline
def synthetic_data_pipeline(
domain: str,
system_prompt: str,
target_count: int = 500,
) -> list[dict]:
"""End-to-end synthetic data generation pipeline."""
topics = generate_seed_topics(domain, count=target_count // 2)
raw = generate_batch(topics, system_prompt, examples_per_topic=3)
cleaned = quality_filter(raw)
scored = filter_by_score(cleaned, min_score=4.0)
final = dedup_synthetic(scored, threshold=0.80)
# Write to JSONL
with open("synthetic_training_data.jsonl", "w") as f:
for ex in final:
f.write(json.dumps(ex) + "\n")
return final
FAQ
Is it legal and ethical to use LLM-generated data for fine-tuning?
OpenAI's terms allow using their API outputs to train models, including fine-tuning. However, some model licenses restrict using outputs to train competing models. Always check the terms of service for the specific API you use for generation. Ethically, be transparent about synthetic data usage and validate that generated data does not contain harmful biases or fabricated facts.
How do I ensure diversity in synthetic data so the model does not just learn one pattern?
Use three techniques: vary seed topics broadly, use high temperature (0.7-1.0) during generation, and explicitly prompt for different customer personas and scenarios. After generation, analyze the distribution of topics, tones, and response styles. If any category is under-represented, generate additional targeted examples for that category.
What ratio of synthetic to real data should I use?
Start with 100% synthetic data if you have no real data, then gradually replace synthetic examples with real ones as you collect production data. A common production ratio is 30-50% real data mixed with 50-70% synthetic data. Real data anchors the model to actual user patterns while synthetic data provides coverage for edge cases and rare scenarios.
#SyntheticData #FineTuning #DataGeneration #LLM #TrainingData #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.