Corrective RAG: Self-Correcting Retrieval with Relevance Checking and Web Fallback

The Problem CRAG Solves

Standard RAG has a silent failure mode: when the retriever returns irrelevant documents, the LLM either hallucinates an answer based on unrelated context or produces a vague response. The user has no way to know the retrieval failed because the system confidently presents whatever it generates.

Corrective RAG (CRAG) adds a quality gate between retrieval and generation. After retrieving documents, a relevance evaluator scores each result. If scores are high, generation proceeds normally. If scores are low, the system triggers corrective actions — rewriting the query, searching alternative sources, or falling back to web search.

This simple addition dramatically improves answer quality because most RAG failures originate in the retrieval step, not the generation step. Fix retrieval, and generation quality follows.

The CRAG Pipeline

The corrective RAG pipeline has four stages:

Initial retrieval — Standard vector search returns top-K documents
Relevance evaluation — Each document is scored for relevance to the query
Corrective action — Based on scores, the system decides: proceed, refine, or fall back
Generation — Only verified-relevant context reaches the LLM

Full Implementation

from openai import OpenAI
from dataclasses import dataclass
from enum import Enum

client = OpenAI()

class RelevanceLevel(Enum):
    CORRECT = "correct"
    AMBIGUOUS = "ambiguous"
    INCORRECT = "incorrect"

@dataclass
class ScoredDocument:
    content: str
    relevance: RelevanceLevel
    score: float

def evaluate_relevance(
    query: str, document: str
) -> tuple[RelevanceLevel, float]:
    """Score a retrieved document for relevance to the query."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """Rate the relevance of the document
            to the query. Return JSON:
            {"relevance": "correct|ambiguous|incorrect",
             "score": 0.0-1.0,
             "reasoning": "brief explanation"}"""
        }, {
            "role": "user",
            "content": f"Query: {query}\nDocument: {document}"
        }],
        response_format={"type": "json_object"}
    )
    import json
    result = json.loads(response.choices[0].message.content)
    return (
        RelevanceLevel(result["relevance"]),
        result["score"],
    )

def rewrite_query(original_query: str) -> str:
    """Rewrite the query for better retrieval results."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Rewrite this search query to be more "
                       "specific and likely to retrieve relevant "
                       "documents. Return only the rewritten query."
        }, {
            "role": "user",
            "content": original_query
        }],
    )
    return response.choices[0].message.content

Adding Web Search Fallback

When internal documents are insufficient, CRAG falls back to web search:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import requests

def web_search_fallback(query: str) -> list[str]:
    """Search the web when internal retrieval fails."""
    # Using a search API (Tavily, Serper, or similar)
    response = requests.post(
        "https://api.tavily.com/search",
        json={
            "api_key": "your-tavily-key",
            "query": query,
            "max_results": 5,
            "include_raw_content": True,
        }
    )
    results = response.json().get("results", [])
    return [r["raw_content"][:2000] for r in results]

def corrective_rag(
    query: str,
    retriever,
    relevance_threshold: float = 0.5,
) -> str:
    """Full CRAG pipeline with relevance checking
    and web fallback."""
    # Step 1: Initial retrieval
    raw_docs = retriever.search(query, k=5)

    # Step 2: Evaluate relevance of each document
    scored_docs = []
    for doc in raw_docs:
        level, score = evaluate_relevance(query, doc)
        scored_docs.append(ScoredDocument(doc, level, score))

    # Step 3: Determine corrective action
    relevant = [
        d for d in scored_docs
        if d.relevance == RelevanceLevel.CORRECT
    ]
    ambiguous = [
        d for d in scored_docs
        if d.relevance == RelevanceLevel.AMBIGUOUS
    ]

    if relevant:
        # Enough good context — proceed with relevant docs
        context_docs = [d.content for d in relevant]
    elif ambiguous:
        # Rewrite query and try again
        new_query = rewrite_query(query)
        retry_docs = retriever.search(new_query, k=5)
        context_docs = retry_docs
    else:
        # All irrelevant — fall back to web search
        context_docs = web_search_fallback(query)

    # Step 4: Generate with verified context
    context = "\n\n".join(context_docs)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Answer the question using only the "
                       "provided context. If the context is "
                       "insufficient, say so clearly."
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }],
    )
    return response.choices[0].message.content

Tuning Relevance Thresholds

The relevance evaluator is the heart of CRAG. Set thresholds too high and you trigger unnecessary web searches. Set them too low and irrelevant documents slip through. Start with a threshold of 0.5 and calibrate against a labeled dataset of query-document pairs. Use GPT-4o-mini for evaluation to keep costs low — it is accurate enough for binary relevance judgments and 10x cheaper than GPT-4o.

Production Considerations

In production, log every relevance evaluation with the query, document, and score. This creates a dataset for fine-tuning a smaller, faster relevance model. Track your fallback rate — if more than 20% of queries trigger web search, your knowledge base likely has coverage gaps that should be addressed at the indexing level.

FAQ

Does the relevance evaluation step add significant latency?

Each evaluation takes 200-400ms with GPT-4o-mini. Since you can evaluate all documents in parallel, the total added latency is roughly one LLM call regardless of how many documents you retrieved. This 300ms investment prevents far costlier failures from irrelevant context.

Can I use a local model for relevance scoring instead of an API?

Yes. A fine-tuned BERT or DeBERTa classifier trained on query-document relevance pairs can score documents in under 10ms each. Start with an LLM-based evaluator to collect training data, then distill it into a local model for production speed.

How does CRAG compare to simply retrieving more documents?

Retrieving more documents increases the chance of finding relevant content but also increases noise. CRAG is more surgical — it retrieves a focused set, evaluates quality, and only expands the search when necessary. This keeps context windows clean and generation quality high.

#CorrectiveRAG #CRAG #RAG #RelevanceScoring #WebSearchFallback #AgenticAI #LearnAI #AIEngineering

Corrective RAG: Self-Correcting Retrieval with Relevance Checking and Web Fallback

The Problem CRAG Solves

The CRAG Pipeline

Full Implementation

Adding Web Search Fallback

Tuning Relevance Thresholds

Production Considerations

FAQ

Does the relevance evaluation step add significant latency?

Can I use a local model for relevance scoring instead of an API?

How does CRAG compare to simply retrieving more documents?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding