Skip to content
Back to Blog
Agentic AI5 min read

What Is the Best Data Format for Fine-Tuning LLMs? A Complete JSONL Guide

JSONL is the standard data format for LLM fine-tuning. Learn why JSON Lines works best, how NeMo Curator processes raw data into JSONL, and best practices for training datasets.

Why Data Format Matters for LLM Fine-Tuning

Before a large language model can learn from your data, that data needs to be in a format the training pipeline can efficiently process. The wrong format creates bottlenecks, wastes compute, and introduces errors. The right format enables scalable, parallel, distributed processing across GPU clusters.

The industry standard for LLM fine-tuning data is JSONL (JSON Lines) — a lightweight, line-delimited format where each line contains a separate, self-contained JSON object.

What Is JSONL?

JSONL (also called JSON Lines or newline-delimited JSON) is a text format where each line is a valid JSON object. Unlike standard JSON, which wraps everything in a single array or object, JSONL treats each line independently.

Example JSONL for instruction fine-tuning:

{"instruction": "Summarize the key benefits of RAG.", "response": "RAG combines retrieval with generation to reduce hallucinations, ground responses in source documents, and enable knowledge updates without retraining."}
{"instruction": "What is LoRA fine-tuning?", "response": "LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains small adapter matrices instead of updating all model weights, reducing compute and memory requirements by 10-100x."}

Each line is a complete training example. No commas between lines. No wrapping array. This simplicity is what makes JSONL powerful at scale.

Why JSONL Is the Standard for LLM Training

1. Streaming and Parallel Processing

Because each line is independent, JSONL files can be processed line by line without loading the entire file into memory. This enables streaming processing of terabyte-scale datasets and parallel ingestion across distributed GPU clusters.

2. Easy Splitting and Sharding

JSONL files can be split at any line boundary without breaking the format. This makes it trivial to shard datasets across multiple training nodes or to create train/validation/test splits.

3. Framework Compatibility

Every major LLM training framework — Hugging Face Transformers, NVIDIA NeMo, DeepSpeed, Megatron-LM — natively supports JSONL input. It is also directly compatible with data processing tools like RAPIDS cuDF for GPU-accelerated data manipulation.

4. Human Readable and Debuggable

Unlike binary formats, JSONL is human-readable. You can inspect, debug, and validate individual training examples with standard text tools — grep, head, jq, or any text editor.

The NeMo Curator Processing Pipeline

NVIDIA's NeMo Curator provides a production-grade pipeline for converting raw data from diverse sources into clean, training-ready JSONL files. The pipeline follows five stages:

Stage 1: Input — URLs or File Paths

The pipeline begins with pointers to raw data sources — web URLs, local file paths, or cloud storage locations. Sources can include HTML pages, PDFs, XML documents, plain text files, or any other structured or unstructured format.

Stage 2: Download — Parallel Retrieval

Files are downloaded in parallel across multiple workers. For web sources, this includes handling rate limiting, retries, and deduplication of URLs. For local sources, files are read from disk with efficient I/O scheduling.

Stage 3: Load — Memory-Efficient Preparation

Downloaded files are loaded into memory-efficient data structures. For large-scale datasets, this uses Dask DataFrames backed by GPU-accelerated cuDF, enabling processing of datasets that exceed available RAM.

Stage 4: Extract — Format Conversion

This is the critical transformation step. Raw formats are converted into clean text:

  • HTML: Boilerplate removal, tag stripping, content extraction
  • PDF: Text extraction with layout-aware parsing
  • XML: Tag parsing and content flattening
  • Custom formats: User-defined extraction functions for proprietary data types

Stage 5: Output — Clean JSONL

The extracted text is written as JSONL files, ready for downstream processing (deduplication, quality filtering, classification) and ultimately for model training.

The entire pipeline is parallelized and distributed, configurable through YAML configuration files, and supports custom extraction functions for specialized data types.

Best Practices for JSONL Training Data

  • One example per line. Never split a training example across multiple lines.
  • Consistent schema. Use the same field names across all examples (e.g., always "instruction" and "response", not sometimes "prompt" and "completion").
  • UTF-8 encoding. Always use UTF-8 to avoid character encoding issues across languages.
  • Validate before training. Run a JSON validator across every line before starting training — a single malformed line can crash the entire pipeline.
  • Include metadata fields. Add fields like "source", "domain", and "quality_score" for filtering and analysis during data curation.

Frequently Asked Questions

Why is JSONL better than CSV for LLM fine-tuning?

JSONL handles nested structures, multi-line text, and special characters naturally, while CSV requires complex escaping rules that frequently break with real-world text data. JSONL also supports arbitrary fields per record and is natively compatible with all major LLM training frameworks. CSV is better suited for simple tabular data, not instruction-response pairs with long-form text.

What fields should a JSONL fine-tuning file contain?

For instruction fine-tuning, the minimum fields are "instruction" (the user prompt) and "response" (the target model output). For chat fine-tuning, use a "messages" array with role/content objects. Optional but recommended fields include "system" (system prompt), "source" (data provenance), and metadata fields for filtering.

How large can a JSONL file be for LLM training?

Individual JSONL files can be any size, but practical considerations suggest splitting at 1-10 GB per file for efficient parallel loading. Most training frameworks support reading from multiple JSONL files (a directory of shards), which enables better parallelism and fault tolerance during distributed training.

Can I use other formats like Parquet instead of JSONL?

Yes. Parquet is increasingly popular for large-scale LLM training because it offers columnar compression, efficient filtering, and better I/O performance for very large datasets. However, JSONL remains the most universal format — every framework supports it, it is human-readable, and it requires no special tooling to create or inspect. Many teams use JSONL for development and Parquet for production-scale training.

How does NeMo Curator handle PDFs and HTML in the pipeline?

NeMo Curator uses specialized extractors for each input format. HTML extraction removes boilerplate (navigation, footers, ads) and extracts main content text. PDF extraction handles layout-aware text parsing, including multi-column layouts and embedded tables. Both extractors output clean text that is then written to JSONL format for downstream processing.

Share this article
A

Admin

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.