Document Chunking Strategies for RAG: Fixed-Size, Semantic, and Recursive
Learn the most effective document chunking methods for RAG pipelines including fixed-size, semantic, and recursive splitting, with guidance on overlap, chunk sizes, and markdown-aware strategies.
Why Chunking Matters More Than You Think
Chunking is the single most impactful decision in a RAG pipeline. If your chunks are too large, they contain too much noise and the embedding becomes a blurry average of unrelated ideas. If they are too small, they lose context and the retrieved snippet is meaningless on its own. The embedding model and the LLM both perform best when each chunk represents one coherent idea.
This post covers the three primary chunking strategies, their tradeoffs, and production-ready implementations.
Strategy 1: Fixed-Size Chunking
The simplest approach splits text into chunks of a fixed token or character count with optional overlap.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
separator="\n",
chunk_size=500, # characters
chunk_overlap=50, # overlap between consecutive chunks
length_function=len,
)
chunks = splitter.split_text(document_text)
print(f"Created {len(chunks)} chunks")
print(f"Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")
Pros: Simple, predictable chunk sizes, easy to reason about token costs.
Cons: Splits mid-sentence and mid-paragraph, breaking semantic coherence. A chunk might start with "...the patient should take 200mg" without any indication of which medication is being discussed.
Best for: Unstructured plain text where no natural boundaries exist, or as a baseline to compare against smarter methods.
Strategy 2: Recursive Character Splitting
This is the most popular strategy in production RAG systems. It tries to split on natural boundaries — paragraphs first, then sentences, then words — and only falls back to character-level splits when necessary.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=[
"\n\n", # Try paragraph breaks first
"\n", # Then line breaks
". ", # Then sentence endings
", ", # Then clause boundaries
" ", # Then word boundaries
"" # Last resort: character-level
]
)
chunks = splitter.split_text(document_text)
The algorithm walks through the separator list in order. It first tries to split on double newlines (paragraphs). If a resulting chunk exceeds chunk_size, it recursively splits that chunk using the next separator.
Pros: Preserves semantic boundaries in most cases. Paragraphs stay intact when possible.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Cons: Chunk sizes still vary. Does not understand the actual meaning of the text.
Strategy 3: Semantic Chunking
Semantic chunking uses embedding similarity to detect topic boundaries. It embeds each sentence, then groups consecutive sentences that are semantically similar into the same chunk. When the similarity drops below a threshold, a new chunk begins.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chunker = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=75, # split at 75th percentile dissimilarity
)
chunks = chunker.split_text(document_text)
for i, chunk in enumerate(chunks[:3]):
print(f"Chunk {i}: {len(chunk)} chars | Preview: {chunk[:80]}...")
Pros: Each chunk genuinely covers one coherent topic. Embedding quality improves significantly because the vector represents a single concept.
Cons: Requires an embedding API call for every sentence during indexing (higher cost). Chunk sizes are unpredictable. Slower ingestion.
Markdown-Aware Splitting
Technical documentation, wikis, and README files use markdown headers as natural section boundaries. A markdown-aware splitter respects these headings:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = md_splitter.split_text(markdown_text)
# Each chunk carries its header hierarchy as metadata
for chunk in chunks[:2]:
print(f"Headers: {chunk.metadata}")
print(f"Content: {chunk.page_content[:100]}...")
print()
The metadata (which headers this chunk falls under) is extremely valuable for retrieval. You can prepend headers to the chunk text before embedding so the vector captures the full context.
Choosing Chunk Size: A Practical Guide
There is no universal optimal chunk size. Here are guidelines based on production experience:
| Use Case | Chunk Size | Overlap | Reasoning |
|---|---|---|---|
| Q&A over docs | 256-512 tokens | 10-15% | Small focused chunks match specific questions |
| Summarization | 1024-2048 tokens | 5% | Larger chunks preserve narrative flow |
| Code search | 64-256 tokens | 0 | Functions/classes are natural boundaries |
| Legal/medical | 512-1024 tokens | 15-20% | Higher overlap prevents splitting critical clauses |
Overlap: Why It Matters
Overlap ensures that information spanning a chunk boundary is not lost. Consider a document where paragraph A ends with a key fact and paragraph B provides the explanation. Without overlap, the fact and its explanation land in separate chunks. With a 64-token overlap, the end of chunk N is repeated at the start of chunk N+1.
# Visualize overlap
for i in range(min(3, len(chunks) - 1)):
end_of_current = chunks[i][-80:]
start_of_next = chunks[i + 1][:80]
overlap = set(end_of_current.split()) & set(start_of_next.split())
print(f"Chunks {i}-{i+1} share {len(overlap)} words in overlap region")
FAQ
What chunk size should I start with for a new RAG project?
Start with 512 tokens using recursive character splitting with 64-token overlap. This works well for most question-answering use cases. Then measure retrieval quality and adjust — decrease chunk size if retrieved chunks contain too much irrelevant text, increase if chunks lack sufficient context.
Should I use semantic chunking in production?
Semantic chunking produces higher-quality chunks but is slower and more expensive during ingestion because every sentence requires an embedding call. Use it when ingestion is infrequent (you index documents once or nightly) and retrieval quality is critical. For real-time or high-volume ingestion, recursive splitting is more practical.
How do I handle tables and images in documents?
Tables should be extracted as structured text (CSV or markdown table format) and chunked as complete units — never split a table row across chunks. For images, use a multimodal embedding model or generate a text description of the image and embed that description alongside the surrounding text.
#RAG #DocumentChunking #TextSplitting #NLP #VectorSearch #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.