Self-RAG: Teaching Models to Retrieve, Critique, and Regenerate Adaptively

What Self-RAG Changes About Retrieval

Standard RAG retrieves for every query, regardless of whether the model already knows the answer. Agentic RAG lets an external agent decide about retrieval. Self-RAG goes further — it trains the language model itself to make retrieval decisions, critique its own outputs, and regenerate when its self-assessment indicates poor quality.

The Self-RAG paper introduced four special reflection tokens that the model learns to generate:

Retrieve — Should I retrieve information for this? (yes/no/continue)
IsRelevant — Is this retrieved passage relevant? (relevant/irrelevant)
IsSupported — Is my generation supported by the evidence? (fully/partially/no)
IsUseful — Is this response useful to the user? (5/4/3/2/1)

These tokens act as inline quality gates, making the model self-aware about when it needs help and whether its output is trustworthy.

Implementing Self-RAG Logic

While training a full Self-RAG model requires significant compute, you can implement the Self-RAG decision pattern using prompt engineering and structured outputs:

flowchart TD
    START["Self-RAG: Teaching Models to Retrieve, Critique, …"] --> A
    A["What Self-RAG Changes About Retrieval"]
    A --> B
    B["Implementing Self-RAG Logic"]
    B --> C
    C["The Self-Critique and Regeneration Loop"]
    C --> D
    D["When Self-RAG Beats Standard Approaches"]
    D --> E
    E["FAQ"]
    E --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

from openai import OpenAI
from pydantic import BaseModel
from enum import Enum

client = OpenAI()

class RetrievalDecision(str, Enum):
    YES = "yes"
    NO = "no"

class RelevanceJudgment(str, Enum):
    RELEVANT = "relevant"
    IRRELEVANT = "irrelevant"

class SupportLevel(str, Enum):
    FULLY = "fully_supported"
    PARTIALLY = "partially_supported"
    NOT = "not_supported"

class SelfRAGAssessment(BaseModel):
    needs_retrieval: RetrievalDecision
    reasoning: str

class GenerationCritique(BaseModel):
    support_level: SupportLevel
    usefulness: int  # 1-5 scale
    issues: list[str]
    should_regenerate: bool

def decide_retrieval(query: str) -> SelfRAGAssessment:
    """Model decides if retrieval is needed."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """Assess whether you need to retrieve
            external information to answer this query well.
            Consider:
            - Is this about specific facts, data, or recent events?
            - Could you answer accurately from general knowledge?
            - Is precision critical (medical, legal, financial)?
            Return your assessment as JSON."""
        }, {
            "role": "user",
            "content": query
        }],
        response_format={"type": "json_object"}
    )
    import json
    data = json.loads(response.choices[0].message.content)
    return SelfRAGAssessment(**data)

The Self-Critique and Regeneration Loop

def critique_generation(
    query: str,
    response_text: str,
    evidence: list[str],
) -> GenerationCritique:
    """Model critiques its own output against evidence."""
    evidence_text = "\n".join(
        f"[{i+1}] {e}" for i, e in enumerate(evidence)
    )

    critique_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """Critically evaluate whether the
            generated response is:
            1. Supported by the provided evidence
            2. Useful for answering the user's question
            3. Free from hallucinated claims

            Return JSON with:
            - support_level: fully_supported / partially_supported
              / not_supported
            - usefulness: 1-5
            - issues: list of specific problems found
            - should_regenerate: true if quality is insufficient"""
        }, {
            "role": "user",
            "content": (
                f"Query: {query}\n\n"
                f"Evidence:\n{evidence_text}\n\n"
                f"Generated response:\n{response_text}"
            )
        }],
        response_format={"type": "json_object"}
    )
    import json
    data = json.loads(
        critique_response.choices[0].message.content
    )
    return GenerationCritique(**data)

def self_rag_pipeline(
    query: str,
    retriever,
    max_attempts: int = 3,
) -> str:
    """Full Self-RAG pipeline with adaptive retrieval
    and self-correction."""

    # Step 1: Decide if retrieval is needed
    assessment = decide_retrieval(query)
    evidence = []

    if assessment.needs_retrieval == RetrievalDecision.YES:
        evidence = retriever.search(query, k=5)

        # Filter for relevance
        relevant_evidence = []
        for doc in evidence:
            rel_check = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{
                    "role": "user",
                    "content": (
                        f"Is this document relevant to "
                        f"'{query}'? "
                        f"Answer 'relevant' or 'irrelevant'.\n"
                        f"Document: {doc}"
                    )
                }],
            )
            judgment = rel_check.choices[0].message.content
            if "relevant" in judgment.lower():
                relevant_evidence.append(doc)

        evidence = relevant_evidence or evidence[:3]

    # Step 2: Generate and critique loop
    for attempt in range(max_attempts):
        # Generate response
        context = "\n\n".join(evidence) if evidence else ""
        gen_prompt = (
            f"Context:\n{context}\n\n" if context
            else ""
        ) + f"Question: {query}"

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "Answer the question accurately. "
                           "Only use information from the "
                           "provided context when available."
            }, {
                "role": "user",
                "content": gen_prompt
            }],
        )
        answer = response.choices[0].message.content

        # Skip critique if no evidence to check against
        if not evidence:
            return answer

        # Critique the response
        critique = critique_generation(query, answer, evidence)

        if not critique.should_regenerate:
            return answer

        # If regeneration needed, refine the query
        if attempt < max_attempts - 1:
            refined = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{
                    "role": "user",
                    "content": (
                        f"The answer to '{query}' had issues: "
                        f"{critique.issues}. Rewrite the query "
                        f"to get better retrieval results."
                    )
                }],
            )
            new_query = refined.choices[0].message.content
            evidence = retriever.search(new_query, k=5)

    return answer  # Return best attempt after max retries

When Self-RAG Beats Standard Approaches

Self-RAG outperforms standard RAG in two specific scenarios. First, on open-domain questions where retrieval is sometimes unnecessary — Self-RAG avoids polluting the context with irrelevant retrievals. Second, on fact-critical tasks where hallucination is dangerous — the self-critique loop catches unsupported claims before they reach the user.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Retrieve — Should I retrieve informatio…"]
    CENTER --> N1["IsRelevant — Is this retrieved passage …"]
    CENTER --> N2["IsSupported — Is my generation supporte…"]
    CENTER --> N3["IsUseful — Is this response useful to t…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff

The cost is 2-4x more LLM calls per query. For latency-sensitive applications, consider caching common query patterns and using smaller models for the retrieval decision and relevance checks.

FAQ

Is Self-RAG the same as chain-of-thought with retrieval?

No. Chain-of-thought adds reasoning steps but does not include explicit quality assessment of retrieved evidence or generated output. Self-RAG adds structured self-evaluation — deciding whether to retrieve, judging relevance of retrieved passages, and critiquing whether the response is supported by evidence. These are fundamentally different capabilities.

Can I implement Self-RAG without fine-tuning a model?

Yes, the implementation above uses prompt engineering to simulate Self-RAG behavior with any instruction-following model. True Self-RAG fine-tunes special tokens into the model, which is faster at inference because the model generates reflection tokens natively rather than requiring separate LLM calls. The prompt-based approach is a practical alternative that captures most of the benefits.

How do I measure whether Self-RAG is improving my system?

Track three metrics: retrieval skip rate (how often the model decides retrieval is unnecessary), critique rejection rate (how often generated answers fail self-assessment), and final answer quality (measured via human evaluation or automated scoring). A well-tuned Self-RAG system should skip retrieval for 20-40% of queries and reject/regenerate 10-20% of initial answers.

#SelfRAG #RAG #SelfReflection #AdaptiveRetrieval #LLMCritique #AgenticAI #LearnAI #AIEngineering

Self-RAG: Teaching Models to Retrieve, Critique, and Regenerate Adaptively

What Self-RAG Changes About Retrieval

Implementing Self-RAG Logic

The Self-Critique and Regeneration Loop

When Self-RAG Beats Standard Approaches

FAQ

Is Self-RAG the same as chain-of-thought with retrieval?

Can I implement Self-RAG without fine-tuning a model?

How do I measure whether Self-RAG is improving my system?

Try CallSphere AI Voice Agents

Related Articles

Building an AI Agent with Tool-Use Chains: Sequential Tool Orchestration for Complex Tasks

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Building a Hypothesis-Testing Agent: Scientific Method Applied to Data Analysis