Synthetic Data Generation for Fine-Tuning: Using LLMs to Create Training Data

Why Generate Synthetic Training Data

The biggest bottleneck in fine-tuning is not compute or infrastructure — it is high-quality training data. Expert annotation is expensive and slow. Production logs may not cover edge cases. Synthetic data generation uses a capable LLM (the "teacher") to create training examples for a smaller model (the "student").

This approach is used extensively in production. Many of the best open-source models were trained partly on synthetic data generated by larger models. The key is quality control — raw LLM output is not training-ready. It requires filtering, validation, and deduplication.

The Generation Pipeline

A robust synthetic data pipeline has four stages: seed creation, generation, filtering, and deduplication.

from openai import OpenAI
import json
import random
from typing import Optional

client = OpenAI()

def generate_seed_topics(domain: str, count: int = 50) -> list[str]:
    """Generate diverse seed topics for a domain."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Generate diverse, specific topics. Output one topic per line, no numbering."
            },
            {
                "role": "user",
                "content": f"List {count} diverse topics for a {domain} assistant. "
                           f"Cover common cases, edge cases, and tricky scenarios."
            },
        ],
        temperature=1.0,  # High temperature for diversity
    )
    topics = [
        line.strip()
        for line in response.choices[0].message.content.strip().split("\n")
        if line.strip()
    ]
    return topics

# Generate seeds
topics = generate_seed_topics("customer support for a SaaS billing platform")
print(f"Generated {len(topics)} seed topics")

Generating Training Examples

For each seed topic, generate a complete conversation. Use detailed system prompts to control the format and quality of the output.

GENERATION_PROMPT = """You are generating training data for a customer support AI.

Given a topic, create a realistic customer support interaction.

Requirements:
- The customer message should sound natural, as if written by a real person
- Include relevant details (account numbers, dates, specific issues)
- The assistant response should be helpful, accurate, and follow company policy
- Keep responses concise but complete
- Vary the tone: some customers are frustrated, some are polite, some are confused

Output EXACTLY this JSON format:
{
  "user_message": "the customer's message",
  "assistant_response": "the support agent's response"
}"""

def generate_example(
    topic: str,
    system_prompt: str,
    model: str = "gpt-4o",
) -> Optional[dict]:
    """Generate a single training example from a seed topic."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": GENERATION_PROMPT},
                {"role": "user", "content": f"Topic: {topic}"},
            ],
            temperature=0.8,
            response_format={"type": "json_object"},
        )
        data = json.loads(response.choices[0].message.content)

        if "user_message" not in data or "assistant_response" not in data:
            return None

        return {
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": data["user_message"]},
                {"role": "assistant", "content": data["assistant_response"]},
            ]
        }
    except (json.JSONDecodeError, KeyError):
        return None

# Generate examples in batch
SYSTEM_PROMPT = "You are a helpful customer support agent for BillingPro, a SaaS billing platform."

def generate_batch(
    topics: list[str],
    system_prompt: str,
    examples_per_topic: int = 3,
) -> list[dict]:
    """Generate multiple examples per topic."""
    all_examples = []
    for topic in topics:
        for _ in range(examples_per_topic):
            example = generate_example(topic, system_prompt)
            if example:
                all_examples.append(example)
    print(f"Generated {len(all_examples)} examples from {len(topics)} topics")
    return all_examples

Quality Filtering

Not all generated examples are good enough for training. Filter by length, coherence, and content quality.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def quality_filter(examples: list[dict]) -> list[dict]:
    """Filter examples based on quality heuristics."""
    filtered = []

    for ex in examples:
        messages = ex["messages"]
        user_msg = messages[1]["content"]
        assistant_msg = messages[2]["content"]

        # Length checks
        user_words = len(user_msg.split())
        assistant_words = len(assistant_msg.split())

        if user_words < 5 or user_words > 500:
            continue
        if assistant_words < 10 or assistant_words > 1000:
            continue

        # Content checks
        if assistant_msg.strip().startswith("I'm sorry, I can't"):
            continue

        # Check for placeholder text
        placeholders = ["[insert", "[your", "xxx", "placeholder"]
        if any(p in assistant_msg.lower() for p in placeholders):
            continue

        # Check assistant actually addresses the user's question
        if len(assistant_msg) < len(user_msg) * 0.3:
            continue

        filtered.append(ex)

    print(f"Quality filter: {len(filtered)}/{len(examples)} passed")
    return filtered

Deduplication for Synthetic Data

LLMs tend to generate similar outputs even with different seeds. Aggressive deduplication is essential.

import hashlib
from difflib import SequenceMatcher

def dedup_synthetic(examples: list[dict], threshold: float = 0.80) -> list[dict]:
    """Remove near-duplicate synthetic examples."""
    unique = []
    seen_hashes = set()

    for ex in examples:
        user_msg = ex["messages"][1]["content"]
        assistant_msg = ex["messages"][2]["content"]
        combined = user_msg + assistant_msg

        # Exact dedup
        content_hash = hashlib.md5(combined.encode()).hexdigest()
        if content_hash in seen_hashes:
            continue
        seen_hashes.add(content_hash)

        # Fuzzy dedup against all kept examples
        is_dup = False
        for kept in unique:
            kept_combined = kept["messages"][1]["content"] + kept["messages"][2]["content"]
            similarity = SequenceMatcher(None, combined, kept_combined).ratio()
            if similarity > threshold:
                is_dup = True
                break

        if not is_dup:
            unique.append(ex)

    print(f"Dedup: {len(unique)}/{len(examples)} unique")
    return unique

Full Pipeline

def synthetic_data_pipeline(
    domain: str,
    system_prompt: str,
    target_count: int = 500,
) -> list[dict]:
    """End-to-end synthetic data generation pipeline."""
    topics = generate_seed_topics(domain, count=target_count // 2)
    raw = generate_batch(topics, system_prompt, examples_per_topic=3)
    cleaned = quality_filter(raw)
    scored = filter_by_score(cleaned, min_score=4.0)
    final = dedup_synthetic(scored, threshold=0.80)

    # Write to JSONL
    with open("synthetic_training_data.jsonl", "w") as f:
        for ex in final:
            f.write(json.dumps(ex) + "\n")

    return final

FAQ

Is it legal and ethical to use LLM-generated data for fine-tuning?

OpenAI's terms allow using their API outputs to train models, including fine-tuning. However, some model licenses restrict using outputs to train competing models. Always check the terms of service for the specific API you use for generation. Ethically, be transparent about synthetic data usage and validate that generated data does not contain harmful biases or fabricated facts.

How do I ensure diversity in synthetic data so the model does not just learn one pattern?

Use three techniques: vary seed topics broadly, use high temperature (0.7-1.0) during generation, and explicitly prompt for different customer personas and scenarios. After generation, analyze the distribution of topics, tones, and response styles. If any category is under-represented, generate additional targeted examples for that category.

What ratio of synthetic to real data should I use?

Start with 100% synthetic data if you have no real data, then gradually replace synthetic examples with real ones as you collect production data. A common production ratio is 30-50% real data mixed with 50-70% synthetic data. Real data anchors the model to actual user patterns while synthetic data provides coverage for edge cases and rare scenarios.

#SyntheticData #FineTuning #DataGeneration #LLM #TrainingData #AgenticAI #LearnAI #AIEngineering

Synthetic Data Generation for Fine-Tuning: Using LLMs to Create Training Data

Why Generate Synthetic Training Data

The Generation Pipeline

Generating Training Examples

Quality Filtering

Deduplication for Synthetic Data

Full Pipeline

FAQ

Is it legal and ethical to use LLM-generated data for fine-tuning?

How do I ensure diversity in synthetic data so the model does not just learn one pattern?

What ratio of synthetic to real data should I use?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding