Skip to content
Learn Agentic AI11 min read0 views

Document Chunking Strategies for RAG: Fixed-Size, Semantic, and Recursive

Learn the most effective document chunking methods for RAG pipelines including fixed-size, semantic, and recursive splitting, with guidance on overlap, chunk sizes, and markdown-aware strategies.

Why Chunking Matters More Than You Think

Chunking is the single most impactful decision in a RAG pipeline. If your chunks are too large, they contain too much noise and the embedding becomes a blurry average of unrelated ideas. If they are too small, they lose context and the retrieved snippet is meaningless on its own. The embedding model and the LLM both perform best when each chunk represents one coherent idea.

This post covers the three primary chunking strategies, their tradeoffs, and production-ready implementations.

Strategy 1: Fixed-Size Chunking

The simplest approach splits text into chunks of a fixed token or character count with optional overlap.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,       # characters
    chunk_overlap=50,     # overlap between consecutive chunks
    length_function=len,
)

chunks = splitter.split_text(document_text)
print(f"Created {len(chunks)} chunks")
print(f"Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")

Pros: Simple, predictable chunk sizes, easy to reason about token costs.

Cons: Splits mid-sentence and mid-paragraph, breaking semantic coherence. A chunk might start with "...the patient should take 200mg" without any indication of which medication is being discussed.

Best for: Unstructured plain text where no natural boundaries exist, or as a baseline to compare against smarter methods.

Strategy 2: Recursive Character Splitting

This is the most popular strategy in production RAG systems. It tries to split on natural boundaries — paragraphs first, then sentences, then words — and only falls back to character-level splits when necessary.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=[
        "\n\n",   # Try paragraph breaks first
        "\n",     # Then line breaks
        ". ",      # Then sentence endings
        ", ",      # Then clause boundaries
        " ",       # Then word boundaries
        ""         # Last resort: character-level
    ]
)

chunks = splitter.split_text(document_text)

The algorithm walks through the separator list in order. It first tries to split on double newlines (paragraphs). If a resulting chunk exceeds chunk_size, it recursively splits that chunk using the next separator.

Pros: Preserves semantic boundaries in most cases. Paragraphs stay intact when possible.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Cons: Chunk sizes still vary. Does not understand the actual meaning of the text.

Strategy 3: Semantic Chunking

Semantic chunking uses embedding similarity to detect topic boundaries. It embeds each sentence, then groups consecutive sentences that are semantically similar into the same chunk. When the similarity drops below a threshold, a new chunk begins.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75,  # split at 75th percentile dissimilarity
)

chunks = chunker.split_text(document_text)

for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i}: {len(chunk)} chars | Preview: {chunk[:80]}...")

Pros: Each chunk genuinely covers one coherent topic. Embedding quality improves significantly because the vector represents a single concept.

Cons: Requires an embedding API call for every sentence during indexing (higher cost). Chunk sizes are unpredictable. Slower ingestion.

Markdown-Aware Splitting

Technical documentation, wikis, and README files use markdown headers as natural section boundaries. A markdown-aware splitter respects these headings:

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

chunks = md_splitter.split_text(markdown_text)

# Each chunk carries its header hierarchy as metadata
for chunk in chunks[:2]:
    print(f"Headers: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:100]}...")
    print()

The metadata (which headers this chunk falls under) is extremely valuable for retrieval. You can prepend headers to the chunk text before embedding so the vector captures the full context.

Choosing Chunk Size: A Practical Guide

There is no universal optimal chunk size. Here are guidelines based on production experience:

Use Case Chunk Size Overlap Reasoning
Q&A over docs 256-512 tokens 10-15% Small focused chunks match specific questions
Summarization 1024-2048 tokens 5% Larger chunks preserve narrative flow
Code search 64-256 tokens 0 Functions/classes are natural boundaries
Legal/medical 512-1024 tokens 15-20% Higher overlap prevents splitting critical clauses

Overlap: Why It Matters

Overlap ensures that information spanning a chunk boundary is not lost. Consider a document where paragraph A ends with a key fact and paragraph B provides the explanation. Without overlap, the fact and its explanation land in separate chunks. With a 64-token overlap, the end of chunk N is repeated at the start of chunk N+1.

# Visualize overlap
for i in range(min(3, len(chunks) - 1)):
    end_of_current = chunks[i][-80:]
    start_of_next = chunks[i + 1][:80]
    overlap = set(end_of_current.split()) & set(start_of_next.split())
    print(f"Chunks {i}-{i+1} share {len(overlap)} words in overlap region")

FAQ

What chunk size should I start with for a new RAG project?

Start with 512 tokens using recursive character splitting with 64-token overlap. This works well for most question-answering use cases. Then measure retrieval quality and adjust — decrease chunk size if retrieved chunks contain too much irrelevant text, increase if chunks lack sufficient context.

Should I use semantic chunking in production?

Semantic chunking produces higher-quality chunks but is slower and more expensive during ingestion because every sentence requires an embedding call. Use it when ingestion is infrequent (you index documents once or nightly) and retrieval quality is critical. For real-time or high-volume ingestion, recursive splitting is more practical.

How do I handle tables and images in documents?

Tables should be extracted as structured text (CSV or markdown table format) and chunked as complete units — never split a table row across chunks. For images, use a multimodal embedding model or generate a text description of the image and embed that description alongside the surrounding text.


#RAG #DocumentChunking #TextSplitting #NLP #VectorSearch #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.