Debugging RAG Retrieval: When the Agent Retrieves Wrong or Irrelevant Documents
Learn systematic approaches to debugging RAG retrieval failures including query analysis, embedding inspection, relevance scoring evaluation, and chunk quality review for more accurate AI agent responses.
The Right Question, the Wrong Answer
Your RAG-powered agent has access to thousands of documents. A user asks a straightforward question. The agent retrieves three chunks, synthesizes a response, and delivers it confidently. The response is wrong — not because the model hallucinated, but because it was given the wrong documents to work with.
RAG retrieval failures are particularly dangerous because the agent has no way to know it retrieved bad chunks. It trusts what it receives and generates a plausible-sounding answer from irrelevant source material. Debugging this requires inspecting every stage of the retrieval pipeline.
The RAG Retrieval Pipeline
Every RAG query passes through four stages, and failures can occur at each one:
- Query formation: The user question is transformed into a search query
- Embedding: The query is converted to a vector
- Vector search: The nearest neighbor chunks are retrieved
- Relevance filtering: Results below a threshold are discarded
Build a debugger that captures data at every stage:
import numpy as np
from dataclasses import dataclass, field
@dataclass
class RetrievalDebugInfo:
original_query: str = ""
search_query: str = ""
query_embedding: list[float] = field(default_factory=list)
raw_results: list[dict] = field(default_factory=list)
filtered_results: list[dict] = field(default_factory=list)
similarity_scores: list[float] = field(default_factory=list)
class RAGDebugger:
def __init__(self, embedding_client, vector_store):
self.embedding_client = embedding_client
self.vector_store = vector_store
async def debug_retrieve(
self,
query: str,
top_k: int = 5,
threshold: float = 0.7,
) -> RetrievalDebugInfo:
info = RetrievalDebugInfo(original_query=query)
# Stage 1: Query formation
info.search_query = query # or apply transformation
print(f"[1] Query: {info.search_query}")
# Stage 2: Embedding
response = await self.embedding_client.embeddings.create(
model="text-embedding-3-small",
input=info.search_query,
)
info.query_embedding = response.data[0].embedding
print(f"[2] Embedding dim: {len(info.query_embedding)}")
# Stage 3: Vector search
results = await self.vector_store.query(
embedding=info.query_embedding,
top_k=top_k,
)
info.raw_results = results
info.similarity_scores = [r["score"] for r in results]
print(f"[3] Raw results: {len(results)}")
for i, r in enumerate(results):
print(f" [{i}] score={r['score']:.4f} | {r['text'][:80]}...")
# Stage 4: Filtering
info.filtered_results = [
r for r in results if r["score"] >= threshold
]
print(f"[4] After filter (>={threshold}): {len(info.filtered_results)}")
return info
Diagnosing Query-Document Mismatch
The most common RAG failure is a semantic gap between the query and the stored chunks. The user asks one thing, but the embedding model interprets it differently:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
async def diagnose_query_mismatch(
debugger, query: str, expected_doc_ids: list[str]
):
"""Check if expected documents score higher than retrieved ones."""
info = await debugger.debug_retrieve(query, top_k=20)
retrieved_ids = {r["id"] for r in info.raw_results}
expected_set = set(expected_doc_ids)
found = expected_set & retrieved_ids
missed = expected_set - retrieved_ids
print(f"Expected docs found in top-20: {len(found)}/{len(expected_set)}")
if missed:
print(f"Missing doc IDs: {missed}")
# Fetch embeddings for missing docs and compute similarity
for doc_id in missed:
doc = await debugger.vector_store.get_by_id(doc_id)
if doc:
doc_emb = doc["embedding"]
query_emb = np.array(info.query_embedding)
similarity = np.dot(query_emb, np.array(doc_emb)) / (
np.linalg.norm(query_emb) * np.linalg.norm(doc_emb)
)
print(f" {doc_id}: similarity={similarity:.4f}")
print(f" Content: {doc['text'][:100]}...")
Inspecting Chunk Quality
Bad chunking is a silent killer of RAG accuracy. Chunks that split important information across boundaries lose semantic coherence:
class ChunkQualityAnalyzer:
def __init__(self, embedding_client):
self.client = embedding_client
async def analyze_chunks(self, chunks: list[str], query: str):
"""Score each chunk for self-containedness and relevance."""
# Embed query and all chunks
all_texts = [query] + chunks
response = await self.client.embeddings.create(
model="text-embedding-3-small",
input=all_texts,
)
embeddings = [d.embedding for d in response.data]
query_emb = np.array(embeddings[0])
print(f"Analyzing {len(chunks)} chunks against query")
print("-" * 60)
for i, chunk in enumerate(chunks):
chunk_emb = np.array(embeddings[i + 1])
similarity = float(np.dot(query_emb, chunk_emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(chunk_emb)
))
word_count = len(chunk.split())
has_incomplete_sentence = (
not chunk.strip().endswith((".", "!", "?", '."', ".'"))
)
print(f"Chunk {i}: similarity={similarity:.4f}, "
f"words={word_count}, "
f"incomplete={'YES' if has_incomplete_sentence else 'no'}")
if has_incomplete_sentence:
print(f" Ends with: ...{chunk[-60:]}")
Testing with Known-Good Queries
Build a test suite of queries with expected document matches to catch retrieval regressions:
class RAGTestSuite:
def __init__(self, debugger):
self.debugger = debugger
self.test_cases = []
def add_case(self, query: str, expected_doc_ids: list[str], threshold=0.7):
self.test_cases.append({
"query": query,
"expected": expected_doc_ids,
"threshold": threshold,
})
async def run(self):
results = []
for case in self.test_cases:
info = await self.debugger.debug_retrieve(
case["query"], top_k=10, threshold=case["threshold"]
)
retrieved_ids = {r["id"] for r in info.filtered_results}
expected = set(case["expected"])
recall = len(expected & retrieved_ids) / len(expected) if expected else 1.0
results.append({
"query": case["query"],
"recall": recall,
"pass": recall >= 0.8,
})
status = "PASS" if recall >= 0.8 else "FAIL"
print(f"[{status}] recall={recall:.0%} | {case['query'][:60]}")
return results
FAQ
My RAG retrieves documents that are topically related but do not answer the specific question. How do I fix this?
This is a precision problem. Increase your similarity threshold to filter out loosely related chunks. Also consider using a reranker model as a second-stage filter — cross-encoder rerankers like Cohere Rerank or BGE Reranker evaluate query-document pairs more accurately than cosine similarity on embeddings alone.
Should I embed the user question directly or rewrite it before searching?
Query rewriting often improves retrieval significantly. Use the LLM to expand abbreviations, resolve pronouns from conversation history, and rephrase colloquial language into terminology that matches your documents. A simple rewriting step can increase recall by 20 to 40 percent.
How do I decide the right chunk size for my documents?
There is no universal answer — it depends on your content. Start with 500 to 800 tokens with 100-token overlap. Test with your actual queries and measure recall. If chunks are too small, they lack context. If too large, they dilute relevance. Technical documentation often benefits from smaller chunks while narrative content works better with larger ones.
#Debugging #RAG #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.