Parent-Child Chunking for RAG: Small Chunks for Search, Large Chunks for Context
Learn the parent-child chunking strategy where small chunks provide precise search matches while their larger parent chunks provide the full context needed for accurate generation.
The Chunking Dilemma
Every RAG system faces a fundamental tension in chunk sizing. Small chunks (100-200 tokens) produce precise embeddings that match specific queries accurately, but they lack the surrounding context needed for the LLM to generate comprehensive answers. Large chunks (1000-2000 tokens) provide rich context for generation, but their embeddings average over too many concepts, reducing retrieval precision.
This is not a theoretical problem. In practice, a 100-token chunk containing "The annual renewal rate increased to 94% in Q3" will match a revenue retention query perfectly. But the LLM needs the surrounding paragraphs to understand what drove that increase, which segments improved, and what caveats apply. Conversely, a 2000-token chunk about Q3 performance might not rank highly for a specific retention query because the embedding averages over dozens of different topics.
Parent-child chunking resolves this by decoupling search from context.
How Parent-Child Chunking Works
The strategy maintains two levels of chunks:
- Child chunks (small, 100-300 tokens) — Used for embedding and similarity search. These are precise and topically focused.
- Parent chunks (large, 1000-2000 tokens) — Used for context in generation. Each parent contains multiple children.
When a query comes in, the system searches against child chunk embeddings. When a child matches, the system retrieves its parent chunk and sends that larger context to the LLM.
Implementation
from dataclasses import dataclass, field
from openai import OpenAI
import hashlib
import uuid
client = OpenAI()
@dataclass
class Chunk:
id: str
content: str
parent_id: str | None = None
children: list[str] = field(default_factory=list)
embedding: list[float] | None = None
class ParentChildChunker:
def __init__(
self,
parent_size: int = 1500,
child_size: int = 300,
child_overlap: int = 50,
):
self.parent_size = parent_size
self.child_size = child_size
self.child_overlap = child_overlap
self.parents: dict[str, Chunk] = {}
self.children: dict[str, Chunk] = {}
def chunk_document(self, text: str) -> list[Chunk]:
"""Split document into parent and child chunks."""
words = text.split()
all_children = []
# Create parent chunks
for i in range(0, len(words), self.parent_size):
parent_text = " ".join(
words[i:i + self.parent_size]
)
parent_id = str(uuid.uuid4())
parent = Chunk(
id=parent_id, content=parent_text
)
self.parents[parent_id] = parent
# Create child chunks within this parent
parent_words = parent_text.split()
step = self.child_size - self.child_overlap
for j in range(0, len(parent_words), step):
child_text = " ".join(
parent_words[j:j + self.child_size]
)
if len(child_text.split()) < 20:
continue # Skip tiny fragments
child_id = str(uuid.uuid4())
child = Chunk(
id=child_id,
content=child_text,
parent_id=parent_id,
)
self.children[child_id] = child
parent.children.append(child_id)
all_children.append(child)
return all_children
Embedding and Retrieval
Only the child chunks get embedded and stored in the vector index:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from openai import OpenAI
client = OpenAI()
def embed_children(
chunker: ParentChildChunker,
) -> list[Chunk]:
"""Embed only child chunks for search indexing."""
children = list(chunker.children.values())
batch_size = 100
for i in range(0, len(children), batch_size):
batch = children[i:i + batch_size]
response = client.embeddings.create(
model="text-embedding-3-small",
input=[c.content for c in batch],
)
for chunk, emb in zip(batch, response.data):
chunk.embedding = emb.embedding
return children
def parent_child_search(
query: str,
chunker: ParentChildChunker,
vectorstore,
k: int = 5,
) -> list[str]:
"""Search children, return parents for context."""
# Search against child embeddings
child_results = vectorstore.similarity_search(query, k=k)
# Retrieve unique parent chunks
seen_parents = set()
parent_contexts = []
for child_doc in child_results:
child_id = child_doc.metadata["chunk_id"]
child = chunker.children.get(child_id)
if child and child.parent_id not in seen_parents:
seen_parents.add(child.parent_id)
parent = chunker.parents[child.parent_id]
parent_contexts.append(parent.content)
return parent_contexts
Handling Section-Aware Parent Chunks
For structured documents, align parent chunks with document sections rather than using fixed token counts:
import re
def section_aware_chunking(
markdown_text: str,
) -> list[tuple[str, str]]:
"""Create parent chunks aligned with document sections."""
# Split on headings
sections = re.split(
r'(?=^##?s)', markdown_text, flags=re.MULTILINE
)
parents = []
for section in sections:
section = section.strip()
if not section:
continue
# Extract heading as metadata
lines = section.split("\n")
heading = lines[0].strip("# ").strip()
body = "\n".join(lines[1:]).strip()
if len(body.split()) > 50: # Skip near-empty sections
parents.append((heading, body))
return parents
Choosing Chunk Sizes
The optimal sizes depend on your documents and queries. Here are guidelines based on empirical testing:
- Technical documentation: Parent 1500 tokens, Child 200 tokens. Technical queries are precise and benefit from small child chunks.
- Legal contracts: Parent 2000 tokens, Child 300 tokens. Legal context requires broad surrounding text for accurate interpretation.
- Support conversations: Parent 1000 tokens, Child 150 tokens. Individual messages are short but need thread context.
Always evaluate on your specific query patterns. Measure retrieval precision at the child level and answer quality at the parent level.
FAQ
Does parent-child chunking increase storage requirements?
It increases storage by roughly 5-15% compared to single-level chunking because child chunks overlap within parents. However, you only embed and index the children, so vector storage scales with the number of children, not parents. The parent documents can be stored in a simple key-value store.
Can I use more than two levels in the hierarchy?
Yes, three-level hierarchies (grandparent-parent-child) work well for very long documents. Grandparent chunks represent entire sections, parents represent subsections, and children represent individual paragraphs. However, more levels add complexity to the retrieval logic, so only add a level if two levels provably underperform on your evaluation dataset.
How does this compare to overlapping windows in standard chunking?
Overlapping windows add context at the edges of each chunk but do not solve the core precision-context tradeoff. A 500-token chunk with 100-token overlap is still a compromise. Parent-child chunking fully decouples search precision from generation context, giving you the best of both worlds.
#ChunkingStrategy #RAG #ParentChildChunks #VectorSearch #DocumentProcessing #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.