Corrective RAG: Self-Correcting Retrieval with Relevance Checking and Web Fallback
Learn how Corrective RAG (CRAG) adds relevance scoring, re-retrieval, and web search fallback to catch and fix bad retrievals before they reach the user. Full Python implementation included.
The Problem CRAG Solves
Standard RAG has a silent failure mode: when the retriever returns irrelevant documents, the LLM either hallucinates an answer based on unrelated context or produces a vague response. The user has no way to know the retrieval failed because the system confidently presents whatever it generates.
Corrective RAG (CRAG) adds a quality gate between retrieval and generation. After retrieving documents, a relevance evaluator scores each result. If scores are high, generation proceeds normally. If scores are low, the system triggers corrective actions — rewriting the query, searching alternative sources, or falling back to web search.
This simple addition dramatically improves answer quality because most RAG failures originate in the retrieval step, not the generation step. Fix retrieval, and generation quality follows.
The CRAG Pipeline
The corrective RAG pipeline has four stages:
- Initial retrieval — Standard vector search returns top-K documents
- Relevance evaluation — Each document is scored for relevance to the query
- Corrective action — Based on scores, the system decides: proceed, refine, or fall back
- Generation — Only verified-relevant context reaches the LLM
Full Implementation
from openai import OpenAI
from dataclasses import dataclass
from enum import Enum
client = OpenAI()
class RelevanceLevel(Enum):
CORRECT = "correct"
AMBIGUOUS = "ambiguous"
INCORRECT = "incorrect"
@dataclass
class ScoredDocument:
content: str
relevance: RelevanceLevel
score: float
def evaluate_relevance(
query: str, document: str
) -> tuple[RelevanceLevel, float]:
"""Score a retrieved document for relevance to the query."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": """Rate the relevance of the document
to the query. Return JSON:
{"relevance": "correct|ambiguous|incorrect",
"score": 0.0-1.0,
"reasoning": "brief explanation"}"""
}, {
"role": "user",
"content": f"Query: {query}\nDocument: {document}"
}],
response_format={"type": "json_object"}
)
import json
result = json.loads(response.choices[0].message.content)
return (
RelevanceLevel(result["relevance"]),
result["score"],
)
def rewrite_query(original_query: str) -> str:
"""Rewrite the query for better retrieval results."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Rewrite this search query to be more "
"specific and likely to retrieve relevant "
"documents. Return only the rewritten query."
}, {
"role": "user",
"content": original_query
}],
)
return response.choices[0].message.content
Adding Web Search Fallback
When internal documents are insufficient, CRAG falls back to web search:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import requests
def web_search_fallback(query: str) -> list[str]:
"""Search the web when internal retrieval fails."""
# Using a search API (Tavily, Serper, or similar)
response = requests.post(
"https://api.tavily.com/search",
json={
"api_key": "your-tavily-key",
"query": query,
"max_results": 5,
"include_raw_content": True,
}
)
results = response.json().get("results", [])
return [r["raw_content"][:2000] for r in results]
def corrective_rag(
query: str,
retriever,
relevance_threshold: float = 0.5,
) -> str:
"""Full CRAG pipeline with relevance checking
and web fallback."""
# Step 1: Initial retrieval
raw_docs = retriever.search(query, k=5)
# Step 2: Evaluate relevance of each document
scored_docs = []
for doc in raw_docs:
level, score = evaluate_relevance(query, doc)
scored_docs.append(ScoredDocument(doc, level, score))
# Step 3: Determine corrective action
relevant = [
d for d in scored_docs
if d.relevance == RelevanceLevel.CORRECT
]
ambiguous = [
d for d in scored_docs
if d.relevance == RelevanceLevel.AMBIGUOUS
]
if relevant:
# Enough good context — proceed with relevant docs
context_docs = [d.content for d in relevant]
elif ambiguous:
# Rewrite query and try again
new_query = rewrite_query(query)
retry_docs = retriever.search(new_query, k=5)
context_docs = retry_docs
else:
# All irrelevant — fall back to web search
context_docs = web_search_fallback(query)
# Step 4: Generate with verified context
context = "\n\n".join(context_docs)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Answer the question using only the "
"provided context. If the context is "
"insufficient, say so clearly."
}, {
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}],
)
return response.choices[0].message.content
Tuning Relevance Thresholds
The relevance evaluator is the heart of CRAG. Set thresholds too high and you trigger unnecessary web searches. Set them too low and irrelevant documents slip through. Start with a threshold of 0.5 and calibrate against a labeled dataset of query-document pairs. Use GPT-4o-mini for evaluation to keep costs low — it is accurate enough for binary relevance judgments and 10x cheaper than GPT-4o.
Production Considerations
In production, log every relevance evaluation with the query, document, and score. This creates a dataset for fine-tuning a smaller, faster relevance model. Track your fallback rate — if more than 20% of queries trigger web search, your knowledge base likely has coverage gaps that should be addressed at the indexing level.
FAQ
Does the relevance evaluation step add significant latency?
Each evaluation takes 200-400ms with GPT-4o-mini. Since you can evaluate all documents in parallel, the total added latency is roughly one LLM call regardless of how many documents you retrieved. This 300ms investment prevents far costlier failures from irrelevant context.
Can I use a local model for relevance scoring instead of an API?
Yes. A fine-tuned BERT or DeBERTa classifier trained on query-document relevance pairs can score documents in under 10ms each. Start with an LLM-based evaluator to collect training data, then distill it into a local model for production speed.
How does CRAG compare to simply retrieving more documents?
Retrieving more documents increases the chance of finding relevant content but also increases noise. CRAG is more surgical — it retrieves a focused set, evaluates quality, and only expands the search when necessary. This keeps context windows clean and generation quality high.
#CorrectiveRAG #CRAG #RAG #RelevanceScoring #WebSearchFallback #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.