Why Data Curation for LLM Training Takes So Long: Text, Image, and Video Processing Bottlenecks

Why Traditional Data Curation Is Slow

Building an LLM from scratch requires curating massive datasets — often terabytes of text, millions of images, and thousands of hours of video. Traditional data curation pipelines consistently take longer than expected because they encounter bottlenecks at multiple stages. Understanding these bottlenecks is essential for teams planning LLM development timelines and infrastructure investments.

The core problem is that most curation tools were designed for datasets measured in gigabytes, not terabytes. When these tools are applied to LLM-scale data, they hit scaling limits, run out of memory, or process data so slowly that curation timelines extend from days to weeks.

Text Processing Bottlenecks

The text processing pipeline follows six stages: Data Download, Cleaning and Preprocessing, Synthetic Data Generation, Quality Filtering, Deduplication, and Blending/Shuffling.

flowchart TD
    START["Why Data Curation for LLM Training Takes So Long:…"] --> A
    A["Why Traditional Data Curation Is Slow"]
    A --> B
    B["Text Processing Bottlenecks"]
    B --> C
    C["Image Processing Bottlenecks"]
    C --> D
    D["Video Processing Bottlenecks"]
    D --> E
    E["The Root Causes"]
    E --> F
    F["How NeMo Curator Addresses These Bottle…"]
    F --> G
    G["Frequently Asked Questions"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Lack of Tooling for Synthetic Data Generation

Synthetic data generation lacks efficient, automated frameworks for most organizations. Teams either build custom pipelines from scratch or rely on manual processes that do not scale. Rate limiting from cloud LLM APIs further constrains throughput — generating millions of synthetic samples through API calls can take weeks when limited to thousands of requests per minute.

Scaling Bottlenecks in Quality Filtering

Quality filtering algorithms that work on 10,000 documents may fail or run unacceptably slowly on 10 billion documents. Many quality classifiers are CPU-bound and cannot leverage GPU acceleration. As datasets grow to terabyte scale, quality filtering becomes the longest single step in the pipeline.

Deduplication at Scale

Deduplication — identifying and removing duplicate or near-duplicate documents — is computationally expensive because it requires comparing every document against every other document. Naive approaches have quadratic time complexity. Even optimized approaches using MinHash or locality-sensitive hashing require careful tuning to balance speed against deduplication accuracy.

Result

Longer curation times and inconsistent quality when preparing text datasets. Teams frequently underestimate the time required by 3-5x because they benchmark on small samples that do not expose scaling bottlenecks.

Image Processing Bottlenecks

The image processing pipeline follows five stages: Data Download, Cleaning and Preprocessing, Quality Filtering, Semantic Deduplication, and Captioning.

Unoptimized Models

Existing models for cleaning, filtering, and captioning images were not designed for large-scale GPU or distributed execution. Most image quality classifiers process one image at a time rather than batching across GPUs. Captioning models generate descriptions sequentially, making it impractical to caption millions of images without distributed infrastructure.

Semantic Deduplication

Finding semantically similar (not just pixel-identical) images is computationally intensive. The process requires generating embeddings for every image and then performing nearest-neighbor search across millions of vectors. This does not scale linearly — doubling the dataset more than doubles the deduplication time due to the increased search space.

Result

Slower preparation of image-text datasets and reduced throughput. Teams building multimodal models often discover that image curation is the bottleneck, not text curation, because image processing tools are less mature.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

Video Processing Bottlenecks

The video processing pipeline follows five stages: Splitting and Transcoding, Quality Filtering, Annotation, Semantic Deduplication, and Dataset Creation.

flowchart TD
    ROOT["Why Data Curation for LLM Training Takes So …"] 
    ROOT --> P0["Text Processing Bottlenecks"]
    P0 --> P0C0["Lack of Tooling for Synthetic Data Gene…"]
    P0 --> P0C1["Scaling Bottlenecks in Quality Filtering"]
    P0 --> P0C2["Deduplication at Scale"]
    P0 --> P0C3["Result"]
    ROOT --> P1["Image Processing Bottlenecks"]
    P1 --> P1C0["Unoptimized Models"]
    P1 --> P1C1["Semantic Deduplication"]
    P1 --> P1C2["Result"]
    ROOT --> P2["Video Processing Bottlenecks"]
    P2 --> P2C0["Unoptimized Models"]
    P2 --> P2C1["Semantic Deduplication Across Frames"]
    P2 --> P2C2["Result"]
    ROOT --> P3["The Root Causes"]
    P3 --> P3C0["1. Lack of Automated Tooling"]
    P3 --> P3C1["2. Poor Scaling with Dataset Size"]
    P3 --> P3C2["3. Inefficient or Unoptimized Models"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b

Unoptimized Models

Quality filtering and annotation models for video use non-parallelized or outdated architectures. Many were designed for real-time inference on single videos rather than batch processing of thousands of videos. Annotation models that label video content (actions, objects, scenes) are particularly slow because they must process multiple frames per video.

Semantic Deduplication Across Frames

Video deduplication is the most resource-intensive curation step across all modalities. Each video contains thousands of frames, and deduplication must consider both spatial similarity (individual frames) and temporal similarity (sequences of frames). This multi-dimensional comparison is extremely compute-heavy and does not parallelize easily.

Result

Long runtimes and high compute costs for building large-scale video datasets. Video curation can take 10-50x longer than text curation for equivalent dataset sizes.

The Root Causes

Three systemic issues cause these bottlenecks across all modalities:

flowchart LR
    S0["1. Lack of Automated Tooling"]
    S0 --> S1
    S1["2. Poor Scaling with Dataset Size"]
    S1 --> S2
    S2["3. Inefficient or Unoptimized Models"]
    style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
    style S2 fill:#059669,stroke:#047857,color:#fff

1. Lack of Automated Tooling

Most data curation steps require manual configuration, custom scripts, or tools that were not designed for LLM-scale datasets. There is no unified framework that handles all curation stages from download through blending.

2. Poor Scaling with Dataset Size

Tools that work well on small datasets fail on large ones. This is not a linear degradation — many tools hit memory limits, timeout thresholds, or algorithmic complexity walls that cause catastrophic slowdowns at scale.

3. Inefficient or Unoptimized Models

Models used for quality filtering, classification, captioning, and annotation were often trained for accuracy on benchmarks, not for throughput in production pipelines. They lack GPU optimization, batch processing support, and distributed execution capabilities.

How NeMo Curator Addresses These Bottlenecks

NVIDIA NeMo Curator was built specifically to address these three root causes:

Automated tooling: Provides end-to-end pipelines for text curation, from download through quality filtering, deduplication, and blending
GPU-accelerated scaling: Uses RAPIDS and Dask for distributed processing that scales linearly across multiple GPUs and nodes
Optimized models: Ships with lightweight classifiers (Domain Classifier, Quality Classifier) optimized for high-throughput batch inference

Teams using NeMo Curator report 5-10x faster curation timelines compared to custom pipelines, with more consistent quality outcomes.

Frequently Asked Questions

Why does LLM data curation take so long?

Data curation for LLMs is slow because traditional tools were designed for gigabyte-scale datasets, not the terabyte-scale datasets that LLMs require. Three systemic bottlenecks — lack of automated tooling, poor scaling with dataset size, and unoptimized models — compound to extend curation timelines from days to weeks across text, image, and video processing.

What is the hardest part of data curation for LLMs?

Deduplication is typically the hardest and most time-consuming step. It requires comparing every document or image against every other one, creating quadratic time complexity in naive implementations. Semantic deduplication (finding near-duplicates rather than exact copies) is particularly challenging because it requires embedding generation and nearest-neighbor search at scale.

How does NVIDIA NeMo Curator speed up data curation?

NeMo Curator uses GPU-accelerated processing through NVIDIA RAPIDS and Dask for distributed computation. It provides end-to-end pipelines with optimized classifier models that process terabytes of data in hours rather than weeks. Linear scaling across multiple GPUs means that adding more hardware proportionally reduces processing time.

Can you curate multimodal data (text, images, video) in one pipeline?

Currently, most curation pipelines handle each modality separately because the processing steps and tools differ significantly. Text curation focuses on quality filtering and deduplication; image curation adds captioning and semantic deduplication; video curation adds frame splitting and temporal analysis. NeMo Curator primarily handles text, with expanding support for multimodal pipelines.

How much data is needed to train an LLM from scratch?

Training an LLM from scratch typically requires 1-15 trillion tokens of curated text, depending on model size. Curating this volume of data from raw web crawls involves downloading 5-10x more data than the final training set, then filtering, deduplicating, and balancing to produce the final blend. This curation process is why data preparation often takes longer than model training itself.

Why Data Curation for LLM Training Takes So Long: Text, Image, and Video Processing Bottlenecks

Why Traditional Data Curation Is Slow

Text Processing Bottlenecks

Lack of Tooling for Synthetic Data Generation

Scaling Bottlenecks in Quality Filtering

Deduplication at Scale

Result

Image Processing Bottlenecks

Unoptimized Models

Semantic Deduplication

Result

Video Processing Bottlenecks

Unoptimized Models

Semantic Deduplication Across Frames

Result

The Root Causes

1. Lack of Automated Tooling

2. Poor Scaling with Dataset Size

3. Inefficient or Unoptimized Models

How NeMo Curator Addresses These Bottlenecks

Frequently Asked Questions

Why does LLM data curation take so long?

What is the hardest part of data curation for LLMs?

How does NVIDIA NeMo Curator speed up data curation?

Can you curate multimodal data (text, images, video) in one pipeline?

How much data is needed to train an LLM from scratch?

Try CallSphere AI Voice Agents

Related Articles

The Context Window Challenge in Multi-Agent Systems: Managing Token Explosion | CallSphere Blog

High-Throughput Inference for AI Agents: Architecture Patterns That Scale | CallSphere Blog

Building Reliable Tool-Calling AI Agents: From Prototype to Production | CallSphere Blog