RAG Evaluation: Measuring Retrieval Quality and Answer Accuracy

Why You Must Evaluate RAG Systematically

Most teams build a RAG pipeline, try a few queries, see reasonable answers, and ship it. Then users report hallucinations, missing information, and wrong citations. The fundamental problem is that RAG has two independent failure modes — retrieval failures and generation failures — and you cannot diagnose which one is broken without measuring each stage separately.

A retrieval failure means the right document was not in the top-k results. No prompt engineering can fix this. A generation failure means the right document was retrieved but the LLM misinterpreted it, ignored it, or hallucinated beyond it. These require different fixes, and evaluation tells you which one to pursue.

The RAGAS Framework

RAGAS (Retrieval-Augmented Generation Assessment) is the most widely adopted evaluation framework for RAG. It provides four core metrics that decompose RAG quality into measurable components:

Metric	Measures	Range	Higher is better
Faithfulness	Is the answer supported by the context?	0-1	Yes
Answer Relevancy	Does the answer address the question?	0-1	Yes
Context Precision	Are the retrieved docs relevant and well-ordered?	0-1	Yes
Context Recall	Was all necessary information retrieved?	0-1	Yes

Setting Up RAGAS

pip install ragas langchain-openai datasets

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": [
        "What is the refund policy for enterprise plans?",
        "How do I reset my password?",
        "What compliance certifications does the platform have?",
    ],
    "answer": [
        "Enterprise customers can request a full refund within 30 days of purchase.",
        "You can reset your password by clicking Forgot Password on the login page.",
        "The platform is SOC2 Type II and HIPAA compliant.",
    ],
    "contexts": [
        ["Enterprise refund policy: Full refunds available within 30 days of purchase. Pro rata refunds after 30 days."],
        ["Password reset: Navigate to login page, click Forgot Password, enter your email."],
        ["Compliance: SOC2 Type II certified. HIPAA compliant for healthcare customers. GDPR ready."],
    ],
    "ground_truth": [
        "Enterprise customers can get a full refund within 30 days. After 30 days, pro rata refunds apply.",
        "Click Forgot Password on the login page and enter your email to receive a reset link.",
        "The platform has SOC2 Type II, HIPAA compliance, and is GDPR ready.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)

Understanding Each Metric

Faithfulness: Is the Answer Grounded?

Faithfulness measures whether every claim in the generated answer can be traced back to the retrieved context. A faithfulness score of 0.8 means 80% of the claims are supported.

Low faithfulness means the LLM is hallucinating — adding information not present in the context. Fix this by strengthening the system prompt with stricter grounding instructions or lowering the temperature.

Answer Relevancy: Does It Address the Question?

Answer relevancy measures whether the answer actually responds to what was asked. A technically correct answer about the wrong topic scores low.

Low relevancy often indicates the prompt template is not guiding the model well, or the retrieved context is leading the model off-topic.

Context Precision: Are Retrieved Docs Relevant?

Context precision measures how many of the retrieved chunks are actually relevant to the question. If you retrieve 5 chunks but only 2 are relevant, precision is 0.4.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Low precision means your retrieval is noisy — pulling in irrelevant content. Fix this by improving chunking, tuning the embedding model, or adding metadata filters.

Context Recall: Is All Needed Info Retrieved?

Context recall measures whether the retrieved context contains all the information needed to produce the ground-truth answer. Low recall means relevant documents are being missed.

Low recall requires improvements to the retrieval system: better embeddings, hybrid search, or query expansion.

Building an Evaluation Pipeline

For continuous evaluation as your RAG system evolves, automate the process:

import json
from datetime import datetime

class RAGEvaluator:
    def __init__(self, rag_pipeline, eval_questions: list[dict]):
        """
        eval_questions: list of dicts with "question" and "ground_truth" keys
        """
        self.pipeline = rag_pipeline
        self.eval_questions = eval_questions

    def run_evaluation(self) -> dict:
        """Run the RAG pipeline on eval questions and measure quality."""
        questions = []
        answers = []
        contexts = []
        ground_truths = []

        for item in self.eval_questions:
            q = item["question"]
            # Run the RAG pipeline
            result = self.pipeline.ask_with_context(q)

            questions.append(q)
            answers.append(result["answer"])
            contexts.append(result["retrieved_chunks"])
            ground_truths.append(item["ground_truth"])

        dataset = Dataset.from_dict({
            "question": questions,
            "answer": answers,
            "contexts": contexts,
            "ground_truth": ground_truths,
        })

        results = evaluate(
            dataset,
            metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
        )

        # Log results
        report = {
            "timestamp": datetime.now().isoformat(),
            "num_questions": len(questions),
            "metrics": {k: float(v) for k, v in results.items()},
        }

        with open("rag_eval_history.jsonl", "a") as f:
            f.write(json.dumps(report) + "\n")

        return report

Diagnosing Failures

Use the evaluation results to pinpoint what to fix:

Symptom	Likely Cause	Fix
Low context recall, low faithfulness	Retrieval missing key docs	Better embeddings, hybrid search, query expansion
High context recall, low faithfulness	LLM ignoring or misreading context	Stronger prompt, lower temperature, better model
High context recall, low precision	Too much noise in retrieved chunks	Smaller chunks, metadata filtering, re-ranking
Low answer relevancy, high faithfulness	Answer is grounded but off-topic	Improve prompt to focus on the question

Creating a Golden Evaluation Set

The most important investment is building a high-quality evaluation dataset:

# Structure for a golden eval set
golden_eval_set = [
    {
        "question": "What is the SLA for enterprise support?",
        "ground_truth": "Enterprise support SLA guarantees 1-hour response for critical issues and 4-hour response for standard issues.",
        "difficulty": "easy",        # answer in one chunk
        "category": "support",
    },
    {
        "question": "Compare the pricing of Pro and Enterprise plans",
        "ground_truth": "Pro costs $49/month per user with 10GB storage. Enterprise costs $99/month per user with unlimited storage and dedicated support.",
        "difficulty": "multi-hop",   # answer spans multiple chunks
        "category": "pricing",
    },
]

# Save as JSON for version control
with open("eval_golden_set.json", "w") as f:
    json.dump(golden_eval_set, f, indent=2)

Aim for at least 50-100 question-answer pairs covering easy (single chunk), medium (multi-chunk), and hard (requires reasoning across documents) questions.

FAQ

How many evaluation questions do I need?

Start with 30-50 questions that cover your key use cases and edge cases. For production systems, aim for 100+ questions across different categories. More important than quantity is coverage — make sure you have questions that test single-hop retrieval, multi-hop reasoning, and cases where the answer is not in the knowledge base.

Can I use RAGAS without ground truth answers?

Faithfulness and answer relevancy can be computed without ground truth — they only need the question, retrieved context, and generated answer. Context recall requires ground truth because it needs to know what information should have been retrieved. Start with the metrics you can compute, then invest in building ground truth labels over time.

How often should I re-evaluate?

Run evaluation whenever you change the chunking strategy, embedding model, retrieval parameters, prompt template, or generation model. Also run weekly evaluations even when nothing changes to catch regressions from external factors like API model updates or data drift in your knowledge base.

#RAG #Evaluation #RAGAS #LLMTesting #RetrievalQuality #AIMetrics #AgenticAI #LearnAI #AIEngineering