How to Create Synthetic Data for LLM Training with NeMo Curator: Pipelines and APIs

Why Generate Synthetic Data for LLM Training?

Synthetic data generation addresses a fundamental challenge in LLM development: high-quality training data is expensive, time-consuming, and difficult to obtain at scale. Manually curated datasets take months to build, and publicly available data often lacks the quality, diversity, or domain specificity that production models require.

NVIDIA NeMo Curator provides tools for synthetic data generation useful in pretraining, fine-tuning, and evaluation of large language models. Synthetically generated data is particularly valuable for adapting LLMs to low-resource languages or domains, and for performing knowledge distillation from larger models into smaller, more efficient ones.

Connecting to LLM Services

NeMo Curator supports two primary approaches for connecting to the LLM that generates synthetic data:

flowchart TD
    START["How to Create Synthetic Data for LLM Training wit…"] --> A
    A["Why Generate Synthetic Data for LLM Tra…"]
    A --> B
    B["Connecting to LLM Services"]
    B --> C
    C["The Five Synthetic Data Pipelines"]
    C --> D
    D["Scoring with Reward Models"]
    D --> E
    E["Dialogue and Multi-Turn Generation"]
    E --> F
    F["Prompt Template Customization"]
    F --> G
    G["Integration with NeMo Curator Data Proc…"]
    G --> H
    H["Frequently Asked Questions"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

OpenAI API Compatible Services

NeMo Curator integrates with any OpenAI API-compatible service, including NVIDIA's build.nvidia.com endpoints. You initialize an OpenAI-compatible client and query models with standard parameters like temperature, top_p, and max_tokens. This is the simplest setup for getting started.

Self-Hosted Inference with NeMo Deploy

For organizations generating large volumes of synthetic data, self-hosted deployment avoids rate limiting issues that occur with cloud APIs. Deploy models locally using NeMo's Export and Deploy module, then point NeMo Curator at your local endpoint. Self-hosted inference requires explicit conversation formatting using formatters like MixtralFormatter, whereas cloud APIs handle formatting automatically on the backend.

The Five Synthetic Data Pipelines

NeMo Curator's NemotronGenerator class encapsulates five distinct pipelines, originally developed for Nemotron-4 340B training data generation.

1. Open QA Pipeline

Generates general knowledge question-answer pairs through a four-step process:

Step 1: Macro Topic Generation. The system generates broad topics about the world, such as "Climate Change and Sustainable Living" or "Quantum Computing Fundamentals."

Step 2: Subtopic Generation. Each macro topic is expanded into specific subtopics. "Climate Change" might produce subtopics like "Carbon Capture Technologies" or "Ocean Acidification Impacts."

Step 3: Question Creation. Questions are generated relating to each subtopic, ensuring coverage across different angles and difficulty levels.

Step 4: Question Revision. Generated questions are revised for greater detail and specificity, transforming generic questions into ones that require deeper reasoning.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

The pipeline accepts parameters for n_macro_topics, n_subtopics, n_openlines, and n_revisions, giving precise control over dataset size and diversity.

2. Writing Pipeline

Generates diverse writing prompts across formats including emails, essays, poems, technical documentation, and creative fiction. The two-step process generates writing tasks about specified topics, then revises them for greater detail and specificity. Example output: "Write a poem about the most effective sources of renewable energy, focusing on solar and wind energy adoption in developing countries."

3. Closed QA Pipeline

The simplest pipeline, requiring only one step: generating questions about provided documents. This is essential for building retrieval-augmented generation (RAG) evaluation datasets. The pipeline returns tuples pairing each question with its source document index, enabling traceability from generated question back to source material.

4. Math Pipeline

Generates mathematical problems targeted at specific educational levels (elementary, middle school, university). The three-step process generates macro topics, subtopics, and then math problems for each combination. This produces structured datasets for mathematical reasoning evaluation and training.

5. Coding Pipeline

Mirrors the math approach but focused on Python programming problems. The pipeline supports both beginner and advanced difficulty levels through swappable prompt templates, enabling generation of coding challenges at appropriate complexity levels.

Scoring with Reward Models

NeMo Curator can query reward models to score the quality of generated synthetic data. The Nemotron-4 340B reward model evaluates conversations across five quality dimensions:

flowchart TD
    ROOT["How to Create Synthetic Data for LLM Trainin…"] 
    ROOT --> P0["Connecting to LLM Services"]
    P0 --> P0C0["OpenAI API Compatible Services"]
    P0 --> P0C1["Self-Hosted Inference with NeMo Deploy"]
    ROOT --> P1["The Five Synthetic Data Pipelines"]
    P1 --> P1C0["1. Open QA Pipeline"]
    P1 --> P1C1["2. Writing Pipeline"]
    P1 --> P1C2["3. Closed QA Pipeline"]
    P1 --> P1C3["4. Math Pipeline"]
    ROOT --> P2["Dialogue and Multi-Turn Generation"]
    P2 --> P2C0["Dialogue Generation"]
    P2 --> P2C1["Two-Turn Preference Data"]
    ROOT --> P3["Frequently Asked Questions"]
    P3 --> P3C0["What is synthetic data generation for L…"]
    P3 --> P3C1["How does NeMo Curator generate syntheti…"]
    P3 --> P3C2["Can I use custom models for synthetic d…"]
    P3 --> P3C3["How do you ensure synthetic data qualit…"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b

Helpfulness: How well the response addresses the user's need
Correctness: Factual accuracy of the information
Coherence: Logical flow and clarity of the response
Complexity: Depth and sophistication of the content
Verbosity: Appropriate level of detail

Reward model scoring enables automated quality filtering, keeping only synthetic samples that meet quality thresholds across all dimensions.

Dialogue and Multi-Turn Generation

Dialogue Generation

The generate_dialogue method enables LLMs to play both user and assistant roles in a conversation. The n_user_turns parameter specifies the number of user turns, with each followed by an assistant turn, producing conversations of length 2 times n_user_turns. A special prompt template helps the model realistically impersonate users by providing conversation history context.

Two-Turn Preference Data

Two-turn prompts generate preference data containing three turns: initial user request, assistant response, and follow-up user request. This format is essential for training models with Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF).

Prompt Template Customization

Every pipeline step uses a prompt template populated with parameters. Users can access prebuilt templates from NeMo Curator, swap templates for different difficulty levels, or supply entirely custom templates with additional placeholders. This flexibility allows adapting synthetic data generation to domain-specific requirements.

flowchart LR
    S0["1. Open QA Pipeline"]
    S0 --> S1
    S1["2. Writing Pipeline"]
    S1 --> S2
    S2["3. Closed QA Pipeline"]
    S2 --> S3
    S3["4. Math Pipeline"]
    S3 --> S4
    S4["5. Coding Pipeline"]
    style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
    style S4 fill:#059669,stroke:#047857,color:#fff

Integration with NeMo Curator Data Processing

Synthetic data generation operates independently of Dask, since synthetic datasets are typically hundreds of thousands of samples versus the billions handled by NeMo Curator's other modules. Users transition between workflows using DocumentDataset.from_pandas() and DocumentDataset.to_pandas(), enabling seamless movement from generation into quality filtering, deduplication, and other NeMo Curator processing stages.

Frequently Asked Questions

What is synthetic data generation for LLM training?

Synthetic data generation uses existing LLMs to create new training samples programmatically. Instead of manually collecting and labeling data, you use models to generate question-answer pairs, writing prompts, coding challenges, and dialogue conversations at scale. NeMo Curator provides GPU-accelerated pipelines that automate this process across five distinct data types.

How does NeMo Curator generate synthetic data?

NeMo Curator uses five specialized pipelines: Open QA (multi-step topic expansion to questions), Writing (writing prompts across formats), Closed QA (questions from documents), Math (educational math problems), and Coding (Python programming challenges). Each pipeline connects to an LLM service (cloud API or self-hosted) and uses customizable prompt templates to control output quality and diversity.

Can I use custom models for synthetic data generation?

Yes. NeMo Curator supports any OpenAI API-compatible service and self-hosted models via NeMo Deploy. You can use NVIDIA models through build.nvidia.com, OpenAI models, or open-source models deployed locally. For large-scale generation, self-hosted deployment avoids rate limiting and reduces per-token costs.

How do you ensure synthetic data quality?

Quality is ensured through reward model scoring. The Nemotron-4 340B reward model evaluates generated data across helpfulness, correctness, coherence, complexity, and verbosity. Samples below quality thresholds are filtered out. Additionally, generated questions go through revision steps that improve specificity and depth before inclusion in the final dataset.

What is the difference between synthetic data for pretraining and fine-tuning?

Pretraining synthetic data focuses on broad coverage across topics and formats to build general knowledge. Fine-tuning synthetic data targets specific domains, task types, or instruction-following patterns. NeMo Curator's pipelines support both use cases through customizable topic selection, difficulty levels, and output formats.