Debugging RAG Retrieval: When the Agent Retrieves Wrong or Irrelevant Documents

The Right Question, the Wrong Answer

Your RAG-powered agent has access to thousands of documents. A user asks a straightforward question. The agent retrieves three chunks, synthesizes a response, and delivers it confidently. The response is wrong — not because the model hallucinated, but because it was given the wrong documents to work with.

RAG retrieval failures are particularly dangerous because the agent has no way to know it retrieved bad chunks. It trusts what it receives and generates a plausible-sounding answer from irrelevant source material. Debugging this requires inspecting every stage of the retrieval pipeline.

The RAG Retrieval Pipeline

Every RAG query passes through four stages, and failures can occur at each one:

Query formation: The user question is transformed into a search query
Embedding: The query is converted to a vector
Vector search: The nearest neighbor chunks are retrieved
Relevance filtering: Results below a threshold are discarded

Build a debugger that captures data at every stage:

import numpy as np
from dataclasses import dataclass, field

@dataclass
class RetrievalDebugInfo:
    original_query: str = ""
    search_query: str = ""
    query_embedding: list[float] = field(default_factory=list)
    raw_results: list[dict] = field(default_factory=list)
    filtered_results: list[dict] = field(default_factory=list)
    similarity_scores: list[float] = field(default_factory=list)

class RAGDebugger:
    def __init__(self, embedding_client, vector_store):
        self.embedding_client = embedding_client
        self.vector_store = vector_store

    async def debug_retrieve(
        self,
        query: str,
        top_k: int = 5,
        threshold: float = 0.7,
    ) -> RetrievalDebugInfo:
        info = RetrievalDebugInfo(original_query=query)

        # Stage 1: Query formation
        info.search_query = query  # or apply transformation
        print(f"[1] Query: {info.search_query}")

        # Stage 2: Embedding
        response = await self.embedding_client.embeddings.create(
            model="text-embedding-3-small",
            input=info.search_query,
        )
        info.query_embedding = response.data[0].embedding
        print(f"[2] Embedding dim: {len(info.query_embedding)}")

        # Stage 3: Vector search
        results = await self.vector_store.query(
            embedding=info.query_embedding,
            top_k=top_k,
        )
        info.raw_results = results
        info.similarity_scores = [r["score"] for r in results]
        print(f"[3] Raw results: {len(results)}")
        for i, r in enumerate(results):
            print(f"    [{i}] score={r['score']:.4f} | {r['text'][:80]}...")

        # Stage 4: Filtering
        info.filtered_results = [
            r for r in results if r["score"] >= threshold
        ]
        print(f"[4] After filter (>={threshold}): {len(info.filtered_results)}")

        return info

Diagnosing Query-Document Mismatch

The most common RAG failure is a semantic gap between the query and the stored chunks. The user asks one thing, but the embedding model interprets it differently:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

async def diagnose_query_mismatch(
    debugger, query: str, expected_doc_ids: list[str]
):
    """Check if expected documents score higher than retrieved ones."""
    info = await debugger.debug_retrieve(query, top_k=20)

    retrieved_ids = {r["id"] for r in info.raw_results}
    expected_set = set(expected_doc_ids)

    found = expected_set & retrieved_ids
    missed = expected_set - retrieved_ids

    print(f"Expected docs found in top-20: {len(found)}/{len(expected_set)}")
    if missed:
        print(f"Missing doc IDs: {missed}")
        # Fetch embeddings for missing docs and compute similarity
        for doc_id in missed:
            doc = await debugger.vector_store.get_by_id(doc_id)
            if doc:
                doc_emb = doc["embedding"]
                query_emb = np.array(info.query_embedding)
                similarity = np.dot(query_emb, np.array(doc_emb)) / (
                    np.linalg.norm(query_emb) * np.linalg.norm(doc_emb)
                )
                print(f"  {doc_id}: similarity={similarity:.4f}")
                print(f"    Content: {doc['text'][:100]}...")

Inspecting Chunk Quality

Bad chunking is a silent killer of RAG accuracy. Chunks that split important information across boundaries lose semantic coherence:

class ChunkQualityAnalyzer:
    def __init__(self, embedding_client):
        self.client = embedding_client

    async def analyze_chunks(self, chunks: list[str], query: str):
        """Score each chunk for self-containedness and relevance."""
        # Embed query and all chunks
        all_texts = [query] + chunks
        response = await self.client.embeddings.create(
            model="text-embedding-3-small",
            input=all_texts,
        )
        embeddings = [d.embedding for d in response.data]
        query_emb = np.array(embeddings[0])

        print(f"Analyzing {len(chunks)} chunks against query")
        print("-" * 60)

        for i, chunk in enumerate(chunks):
            chunk_emb = np.array(embeddings[i + 1])
            similarity = float(np.dot(query_emb, chunk_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(chunk_emb)
            ))
            word_count = len(chunk.split())
            has_incomplete_sentence = (
                not chunk.strip().endswith((".", "!", "?", '."', ".'"))
            )

            print(f"Chunk {i}: similarity={similarity:.4f}, "
                  f"words={word_count}, "
                  f"incomplete={'YES' if has_incomplete_sentence else 'no'}")
            if has_incomplete_sentence:
                print(f"  Ends with: ...{chunk[-60:]}")

Testing with Known-Good Queries

Build a test suite of queries with expected document matches to catch retrieval regressions:

class RAGTestSuite:
    def __init__(self, debugger):
        self.debugger = debugger
        self.test_cases = []

    def add_case(self, query: str, expected_doc_ids: list[str], threshold=0.7):
        self.test_cases.append({
            "query": query,
            "expected": expected_doc_ids,
            "threshold": threshold,
        })

    async def run(self):
        results = []
        for case in self.test_cases:
            info = await self.debugger.debug_retrieve(
                case["query"], top_k=10, threshold=case["threshold"]
            )
            retrieved_ids = {r["id"] for r in info.filtered_results}
            expected = set(case["expected"])
            recall = len(expected & retrieved_ids) / len(expected) if expected else 1.0

            results.append({
                "query": case["query"],
                "recall": recall,
                "pass": recall >= 0.8,
            })
            status = "PASS" if recall >= 0.8 else "FAIL"
            print(f"[{status}] recall={recall:.0%} | {case['query'][:60]}")
        return results

FAQ

This is a precision problem. Increase your similarity threshold to filter out loosely related chunks. Also consider using a reranker model as a second-stage filter — cross-encoder rerankers like Cohere Rerank or BGE Reranker evaluate query-document pairs more accurately than cosine similarity on embeddings alone.

Should I embed the user question directly or rewrite it before searching?

Query rewriting often improves retrieval significantly. Use the LLM to expand abbreviations, resolve pronouns from conversation history, and rephrase colloquial language into terminology that matches your documents. A simple rewriting step can increase recall by 20 to 40 percent.

How do I decide the right chunk size for my documents?

There is no universal answer — it depends on your content. Start with 500 to 800 tokens with 100-token overlap. Test with your actual queries and measure recall. If chunks are too small, they lack context. If too large, they dilute relevance. Technical documentation often benefits from smaller chunks while narrative content works better with larger ones.

#Debugging #RAG #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering

Debugging RAG Retrieval: When the Agent Retrieves Wrong or Irrelevant Documents

The Right Question, the Wrong Answer

The RAG Retrieval Pipeline

Diagnosing Query-Document Mismatch

Inspecting Chunk Quality

Testing with Known-Good Queries

FAQ

Should I embed the user question directly or rewrite it before searching?

How do I decide the right chunk size for my documents?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding

The Right Question, the Wrong Answer

The RAG Retrieval Pipeline

Diagnosing Query-Document Mismatch

Inspecting Chunk Quality

Testing with Known-Good Queries

FAQ

My RAG retrieves documents that are topically related but do not answer the specific question. How do I fix this?

Should I embed the user question directly or rewrite it before searching?

How do I decide the right chunk size for my documents?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding