Retrieval-Augmented Generation in 2026: Beyond the Basics
Move past naive RAG implementations with advanced techniques including hybrid search, re-ranking, query decomposition, contextual compression, and agentic RAG patterns used in production systems.
The Problem With Naive RAG
The basic RAG pipeline -- chunk documents, embed them, retrieve top-k, stuff into prompt -- works for demos but fails in production. Teams consistently report three categories of failure:
- Retrieval failures: The relevant information exists in the corpus but the retriever does not surface it
- Context failures: Retrieved chunks lack sufficient context to answer the question
- Generation failures: The LLM ignores or misinterprets the retrieved context
Production RAG in 2026 addresses each of these failures with specific techniques. This guide covers the patterns that have proven effective across real deployments.
Advanced Chunking Strategies
Semantic Chunking
Instead of splitting on fixed token counts, semantic chunking uses embedding similarity to find natural breakpoints in the text:
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
def semantic_chunk(text: str, threshold: float = 0.75) -> list[str]:
sentences = text.split(". ")
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = np.dot(embeddings[i], embeddings[i - 1]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i - 1])
)
if similarity < threshold:
# Low similarity = topic shift = chunk boundary
chunks.append(". ".join(current_chunk) + ".")
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(". ".join(current_chunk) + ".")
return chunks
Parent-Child Chunking
Store small chunks for precise retrieval but return their parent context for generation. This solves the core tension between retrieval precision (small chunks match better) and generation quality (larger context produces better answers).
class ParentChildChunker:
def __init__(self, parent_size=2000, child_size=400, overlap=50):
self.parent_size = parent_size
self.child_size = child_size
self.overlap = overlap
def chunk(self, document: str) -> list[dict]:
parents = self._split(document, self.parent_size, self.overlap)
result = []
for parent_idx, parent in enumerate(parents):
children = self._split(parent, self.child_size, self.overlap)
for child in children:
result.append({
"child_text": child, # Embedded for retrieval
"parent_text": parent, # Returned for generation
"parent_id": parent_idx,
})
return result
Hybrid Search: Dense + Sparse Retrieval
Pure vector search fails when queries contain specific identifiers (error codes, product names, dates). Hybrid search combines dense embeddings with sparse keyword matching (BM25) to handle both semantic and lexical queries.
from qdrant_client import QdrantClient, models
client = QdrantClient("localhost", port=6333)
# Create a collection with both dense and sparse vectors
client.create_collection(
collection_name="documents",
vectors_config={
"dense": models.VectorParams(
size=1024, distance=models.Distance.COSINE
)
},
sparse_vectors_config={
"bm25": models.SparseVectorParams(
modifier=models.Modifier.IDF,
)
},
)
# Hybrid search with Reciprocal Rank Fusion
results = client.query_points(
collection_name="documents",
prefetch=[
models.Prefetch(
query=dense_embedding,
using="dense",
limit=20,
),
models.Prefetch(
query=sparse_vector,
using="bm25",
limit=20,
),
],
query=models.FusionQuery(fusion=models.Fusion.RRF),
limit=10,
)
Benchmarks on production datasets consistently show hybrid search improving recall by 15-25% over dense-only search, particularly on queries with specific technical terms.
Re-Ranking: The Missing Middle Layer
The initial retrieval step optimizes for recall (do not miss relevant documents). A re-ranker then optimizes for precision (rank the most relevant results highest). Cross-encoder re-rankers like Cohere Rerank or BGE-reranker evaluate query-document pairs jointly, producing far more accurate relevance scores than embedding cosine similarity.
from cohere import Client
cohere_client = Client(api_key="...")
def rerank_results(query: str, documents: list[str], top_n: int = 5):
response = cohere_client.rerank(
model="rerank-english-v3.0",
query=query,
documents=documents,
top_n=top_n,
)
return [
{"text": documents[r.index], "score": r.relevance_score}
for r in response.results
]
The retrieval pipeline becomes: retrieve 20-50 candidates with hybrid search, then re-rank down to the top 5. This two-stage approach consistently outperforms simply retrieving the top 5 directly.
Query Transformation
Multi-Query Expansion
A single user query often fails to capture all the ways relevant information might be phrased. Multi-query expansion generates multiple reformulations and retrieves results for each:
async def multi_query_retrieve(query: str, retriever, llm) -> list[Document]:
# Generate query variations
expansion_prompt = f"""Generate 3 different search queries that would help
answer this question. Return only the queries, one per line.
Question: {query}"""
variations = await llm.generate(expansion_prompt)
all_queries = [query] + variations.strip().split("\n")
# Retrieve for each query and deduplicate
seen_ids = set()
results = []
for q in all_queries:
docs = await retriever.search(q, top_k=5)
for doc in docs:
if doc.id not in seen_ids:
seen_ids.add(doc.id)
results.append(doc)
return results
Step-Back Prompting
For complex questions, generate a more abstract "step-back" question that retrieves broader context:
- Original: "Why did the Q3 revenue drop for the enterprise segment?"
- Step-back: "What factors affect enterprise segment revenue?"
The step-back results provide foundational context, while the original query retrieves specific details. Combining both produces more complete answers.
Contextual Compression
Retrieved chunks often contain irrelevant sentences mixed with relevant ones. Contextual compression uses an LLM to extract only the query-relevant portions before generation:
async def compress_context(query: str, documents: list[str], llm) -> list[str]:
compressed = []
for doc in documents:
prompt = f"""Extract only the sentences from the following document
that are directly relevant to answering: "{query}"
If nothing is relevant, respond with "NOT_RELEVANT".
Document:
{doc}"""
result = await llm.generate(prompt)
if result.strip() != "NOT_RELEVANT":
compressed.append(result)
return compressed
This technique reduces prompt token usage by 40-60% while maintaining or improving answer quality, because the generation model does not have to filter through irrelevant content.
Agentic RAG
The most powerful RAG pattern in 2026 makes the retrieval pipeline itself agentic. Instead of a fixed retrieve-then-generate pipeline, an agent decides when to retrieve, what to retrieve, and whether the results are sufficient.
class AgenticRAG:
def __init__(self, llm, retriever, max_iterations=5):
self.llm = llm
self.retriever = retriever
self.max_iterations = max_iterations
async def answer(self, question: str) -> str:
context = []
for i in range(self.max_iterations):
# Ask the LLM what to do next
action = await self.llm.decide(
question=question,
context=context,
options=["search", "answer", "refine_query"]
)
if action.type == "answer":
return action.content
elif action.type == "search":
results = await self.retriever.search(action.query)
context.extend(results)
elif action.type == "refine_query":
# The agent reformulates based on what it has learned
results = await self.retriever.search(action.refined_query)
context.extend(results)
return await self._forced_answer(question, context)
Evaluation: Measuring RAG Quality
You cannot improve what you do not measure. The standard RAG evaluation framework uses three metrics:
| Metric | Measures | How |
|---|---|---|
| Context Relevance | Did the retriever find the right documents? | Judge each retrieved chunk for relevance to the query |
| Faithfulness | Does the answer stick to the retrieved context? | Check every claim in the answer against the context |
| Answer Relevance | Does the answer actually address the question? | Judge the answer against the original query |
# Using ragas for evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(
dataset=eval_dataset, # Questions + ground truth + retrieved contexts
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91, 'context_precision': 0.78}
Production RAG systems in 2026 run these evaluations on every deployment, treating retrieval quality as a regression test.
Key Architectural Decisions
Building production RAG comes down to a series of engineering tradeoffs:
- Chunk size: Smaller chunks (200-400 tokens) improve retrieval precision; larger chunks (800-1500 tokens) improve generation quality. Use parent-child chunking to get both.
- Embedding model: Larger models (1024-dim) are more accurate but slower and more expensive to store. For most use cases, a 768-dim model like BGE-large is the sweet spot.
- Top-k: Retrieve more candidates (20-50) and re-rank down to fewer (3-7) for the final prompt.
- Update strategy: Decide between full re-indexing (simpler but slower) and incremental updates (faster but more complex) based on how frequently your data changes.
The teams getting the best results in 2026 treat RAG as an engineering system, not a one-time setup. They instrument every stage, measure quality continuously, and iterate on each component independently.
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.