The Problem With Naive RAG

The basic RAG pipeline -- chunk documents, embed them, retrieve top-k, stuff into prompt -- works for demos but fails in production. Teams consistently report three categories of failure:

Retrieval failures: The relevant information exists in the corpus but the retriever does not surface it
Context failures: Retrieved chunks lack sufficient context to answer the question
Generation failures: The LLM ignores or misinterprets the retrieved context

Production RAG in 2026 addresses each of these failures with specific techniques. This guide covers the patterns that have proven effective across real deployments.

Advanced Chunking Strategies

Semantic Chunking

Instead of splitting on fixed token counts, semantic chunking uses embedding similarity to find natural breakpoints in the text:

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

def semantic_chunk(text: str, threshold: float = 0.75) -> list[str]:
    sentences = text.split(". ")
    embeddings = model.encode(sentences)
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = np.dot(embeddings[i], embeddings[i - 1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i - 1])
        )
        if similarity < threshold:
            # Low similarity = topic shift = chunk boundary
            chunks.append(". ".join(current_chunk) + ".")
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(". ".join(current_chunk) + ".")
    return chunks

Parent-Child Chunking

Store small chunks for precise retrieval but return their parent context for generation. This solves the core tension between retrieval precision (small chunks match better) and generation quality (larger context produces better answers).

class ParentChildChunker:
    def __init__(self, parent_size=2000, child_size=400, overlap=50):
        self.parent_size = parent_size
        self.child_size = child_size
        self.overlap = overlap

    def chunk(self, document: str) -> list[dict]:
        parents = self._split(document, self.parent_size, self.overlap)
        result = []
        for parent_idx, parent in enumerate(parents):
            children = self._split(parent, self.child_size, self.overlap)
            for child in children:
                result.append({
                    "child_text": child,      # Embedded for retrieval
                    "parent_text": parent,     # Returned for generation
                    "parent_id": parent_idx,
                })
        return result

Hybrid Search: Dense + Sparse Retrieval

Pure vector search fails when queries contain specific identifiers (error codes, product names, dates). Hybrid search combines dense embeddings with sparse keyword matching (BM25) to handle both semantic and lexical queries.

from qdrant_client import QdrantClient, models

client = QdrantClient("localhost", port=6333)

# Create a collection with both dense and sparse vectors
client.create_collection(
    collection_name="documents",
    vectors_config={
        "dense": models.VectorParams(
            size=1024, distance=models.Distance.COSINE
        )
    },
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        )
    },
)

# Hybrid search with Reciprocal Rank Fusion
results = client.query_points(
    collection_name="documents",
    prefetch=[
        models.Prefetch(
            query=dense_embedding,
            using="dense",
            limit=20,
        ),
        models.Prefetch(
            query=sparse_vector,
            using="bm25",
            limit=20,
        ),
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=10,
)

Benchmarks on production datasets consistently show hybrid search improving recall by 15-25% over dense-only search, particularly on queries with specific technical terms.

Re-Ranking: The Missing Middle Layer

The initial retrieval step optimizes for recall (do not miss relevant documents). A re-ranker then optimizes for precision (rank the most relevant results highest). Cross-encoder re-rankers like Cohere Rerank or BGE-reranker evaluate query-document pairs jointly, producing far more accurate relevance scores than embedding cosine similarity.

from cohere import Client

cohere_client = Client(api_key="...")

def rerank_results(query: str, documents: list[str], top_n: int = 5):
    response = cohere_client.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=documents,
        top_n=top_n,
    )
    return [
        {"text": documents[r.index], "score": r.relevance_score}
        for r in response.results
    ]

The retrieval pipeline becomes: retrieve 20-50 candidates with hybrid search, then re-rank down to the top 5. This two-stage approach consistently outperforms simply retrieving the top 5 directly.

Query Transformation

Multi-Query Expansion

A single user query often fails to capture all the ways relevant information might be phrased. Multi-query expansion generates multiple reformulations and retrieves results for each:

async def multi_query_retrieve(query: str, retriever, llm) -> list[Document]:
    # Generate query variations
    expansion_prompt = f"""Generate 3 different search queries that would help
answer this question. Return only the queries, one per line.

Question: {query}"""

    variations = await llm.generate(expansion_prompt)
    all_queries = [query] + variations.strip().split("\n")

    # Retrieve for each query and deduplicate
    seen_ids = set()
    results = []
    for q in all_queries:
        docs = await retriever.search(q, top_k=5)
        for doc in docs:
            if doc.id not in seen_ids:
                seen_ids.add(doc.id)
                results.append(doc)

    return results

Step-Back Prompting

For complex questions, generate a more abstract "step-back" question that retrieves broader context:

Original: "Why did the Q3 revenue drop for the enterprise segment?"
Step-back: "What factors affect enterprise segment revenue?"

The step-back results provide foundational context, while the original query retrieves specific details. Combining both produces more complete answers.

Contextual Compression

Retrieved chunks often contain irrelevant sentences mixed with relevant ones. Contextual compression uses an LLM to extract only the query-relevant portions before generation:

async def compress_context(query: str, documents: list[str], llm) -> list[str]:
    compressed = []
    for doc in documents:
        prompt = f"""Extract only the sentences from the following document
that are directly relevant to answering: "{query}"

If nothing is relevant, respond with "NOT_RELEVANT".

Document:
{doc}"""

        result = await llm.generate(prompt)
        if result.strip() != "NOT_RELEVANT":
            compressed.append(result)
    return compressed

This technique reduces prompt token usage by 40-60% while maintaining or improving answer quality, because the generation model does not have to filter through irrelevant content.

Agentic RAG

The most powerful RAG pattern in 2026 makes the retrieval pipeline itself agentic. Instead of a fixed retrieve-then-generate pipeline, an agent decides when to retrieve, what to retrieve, and whether the results are sufficient.

class AgenticRAG:
    def __init__(self, llm, retriever, max_iterations=5):
        self.llm = llm
        self.retriever = retriever
        self.max_iterations = max_iterations

    async def answer(self, question: str) -> str:
        context = []
        for i in range(self.max_iterations):
            # Ask the LLM what to do next
            action = await self.llm.decide(
                question=question,
                context=context,
                options=["search", "answer", "refine_query"]
            )

            if action.type == "answer":
                return action.content
            elif action.type == "search":
                results = await self.retriever.search(action.query)
                context.extend(results)
            elif action.type == "refine_query":
                # The agent reformulates based on what it has learned
                results = await self.retriever.search(action.refined_query)
                context.extend(results)

        return await self._forced_answer(question, context)

Evaluation: Measuring RAG Quality

You cannot improve what you do not measure. The standard RAG evaluation framework uses three metrics:

Metric	Measures	How
Context Relevance	Did the retriever find the right documents?	Judge each retrieved chunk for relevance to the query
Faithfulness	Does the answer stick to the retrieved context?	Check every claim in the answer against the context
Answer Relevance	Does the answer actually address the question?	Judge the answer against the original query

# Using ragas for evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

result = evaluate(
    dataset=eval_dataset,  # Questions + ground truth + retrieved contexts
    metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91, 'context_precision': 0.78}

Production RAG systems in 2026 run these evaluations on every deployment, treating retrieval quality as a regression test.

Key Architectural Decisions

Building production RAG comes down to a series of engineering tradeoffs:

Chunk size: Smaller chunks (200-400 tokens) improve retrieval precision; larger chunks (800-1500 tokens) improve generation quality. Use parent-child chunking to get both.
Embedding model: Larger models (1024-dim) are more accurate but slower and more expensive to store. For most use cases, a 768-dim model like BGE-large is the sweet spot.
Top-k: Retrieve more candidates (20-50) and re-rank down to fewer (3-7) for the final prompt.
Update strategy: Decide between full re-indexing (simpler but slower) and incremental updates (faster but more complex) based on how frequently your data changes.

The teams getting the best results in 2026 treat RAG as an engineering system, not a one-time setup. They instrument every stage, measure quality continuously, and iterate on each component independently.

Retrieval-Augmented Generation in 2026: Beyond the Basics