Evaluating RAG in Production: Building Automated Quality Monitoring for Retrieval Systems

Why Offline Evaluation Is Not Enough

Most teams evaluate their RAG system once during development using a curated test set, declare the results acceptable, and ship to production. Then reality hits. Documents get updated, new content is added, user query patterns shift, and embedding model behavior drifts on edge cases. The system that scored 85% on your test set six weeks ago might be producing incorrect answers 30% of the time today, and nobody knows until users complain.

Production RAG evaluation must be continuous, automated, and multi-dimensional. You need to monitor retrieval quality, generation faithfulness, and user satisfaction — all in real time.

The Four Pillars of RAG Evaluation

1. Retrieval Quality

Are the right documents being retrieved? Measured by context relevance and recall.

2. Generation Faithfulness

Is the LLM's answer actually supported by the retrieved documents? Measured by groundedness.

3. Answer Correctness

Does the answer actually address the user's question? Measured by answer relevance.

4. User Satisfaction

Do users find the answers helpful? Measured by explicit feedback and behavioral signals.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Building an Automated Quality Scorer

from openai import OpenAI
from dataclasses import dataclass
from datetime import datetime
import json

client = OpenAI()

@dataclass
class RAGEvaluation:
    query: str
    retrieved_docs: list[str]
    generated_answer: str
    context_relevance: float
    faithfulness: float
    answer_relevance: float
    timestamp: datetime

def evaluate_context_relevance(
    query: str, documents: list[str]
) -> float:
    """Score how relevant retrieved documents are to the query.
    Returns 0.0 to 1.0."""
    scores = []
    for doc in documents:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": """Rate the relevance of this document
                to the query on a scale of 0.0 to 1.0.
                Return JSON: {"score": 0.X, "reason": "..."}"""
            }, {
                "role": "user",
                "content": f"Query: {query}\nDocument: {doc}"
            }],
            response_format={"type": "json_object"}
        )
        result = json.loads(
            response.choices[0].message.content
        )
        scores.append(result["score"])

    return sum(scores) / len(scores) if scores else 0.0

def evaluate_faithfulness(
    answer: str, documents: list[str]
) -> float:
    """Score whether the answer is grounded in the documents.
    Returns 0.0 to 1.0."""
    context = "\n\n".join(documents)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """Evaluate if each claim in the answer
            is supported by the provided documents.
            Return JSON:
            {
              "claims": [
                {"claim": "...", "supported": true/false}
              ],
              "faithfulness_score": 0.0-1.0
            }"""
        }, {
            "role": "user",
            "content": (
                f"Documents:\n{context}\n\n"
                f"Answer:\n{answer}"
            )
        }],
        response_format={"type": "json_object"}
    )
    result = json.loads(response.choices[0].message.content)
    return result["faithfulness_score"]

def evaluate_answer_relevance(
    query: str, answer: str
) -> float:
    """Score whether the answer addresses the question.
    Returns 0.0 to 1.0."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """Rate how well the answer addresses
            the user's question on a scale of 0.0 to 1.0.
            Return JSON: {"score": 0.X, "reason": "..."}"""
        }, {
            "role": "user",
            "content": f"Question: {query}\nAnswer: {answer}"
        }],
        response_format={"type": "json_object"}
    )
    result = json.loads(response.choices[0].message.content)
    return result["score"]

Integrating Evaluation into Your RAG Pipeline

import logging

logger = logging.getLogger("rag_eval")

class MonitoredRAGPipeline:
    def __init__(self, retriever, eval_sample_rate: float = 0.1):
        self.retriever = retriever
        self.sample_rate = eval_sample_rate
        self.evaluations: list[RAGEvaluation] = []

    def answer(self, query: str) -> str:
        """Answer with optional quality evaluation."""
        import random

        # Retrieve and generate as normal
        docs = self.retriever.search(query, k=5)
        doc_texts = [d.page_content for d in docs]

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": "Answer using the provided context."
            }, {
                "role": "user",
                "content": (
                    f"Context:\n{'chr(10)'.join(doc_texts)}"
                    f"\n\nQuestion: {query}"
                )
            }],
        )
        answer = response.choices[0].message.content

        # Evaluate a sample of responses
        if random.random() < self.sample_rate:
            self._async_evaluate(query, doc_texts, answer)

        return answer

    def _async_evaluate(
        self, query: str, docs: list[str], answer: str
    ):
        """Run evaluation asynchronously to avoid
        adding latency to the response."""
        import threading

        def evaluate():
            try:
                eval_result = RAGEvaluation(
                    query=query,
                    retrieved_docs=docs,
                    generated_answer=answer,
                    context_relevance=evaluate_context_relevance(
                        query, docs
                    ),
                    faithfulness=evaluate_faithfulness(
                        answer, docs
                    ),
                    answer_relevance=evaluate_answer_relevance(
                        query, answer
                    ),
                    timestamp=datetime.now(),
                )
                self.evaluations.append(eval_result)
                self._check_degradation(eval_result)
            except Exception as e:
                logger.error(f"Evaluation failed: {e}")

        thread = threading.Thread(target=evaluate)
        thread.start()

    def _check_degradation(self, evaluation: RAGEvaluation):
        """Alert if quality drops below thresholds."""
        thresholds = {
            "context_relevance": 0.6,
            "faithfulness": 0.7,
            "answer_relevance": 0.6,
        }

        for metric, threshold in thresholds.items():
            value = getattr(evaluation, metric)
            if value < threshold:
                logger.warning(
                    f"Quality degradation detected: "
                    f"{metric}={value:.2f} < {threshold} "
                    f"for query: {evaluation.query[:100]}"
                )

Building a Degradation Detection System

Track rolling averages to detect systemic quality drops, not just individual bad answers:

from collections import deque

class DegradationDetector:
    def __init__(self, window_size: int = 100):
        self.window_size = window_size
        self.context_scores = deque(maxlen=window_size)
        self.faith_scores = deque(maxlen=window_size)
        self.relevance_scores = deque(maxlen=window_size)
        self.alert_threshold = 0.1  # 10% drop triggers alert

    def add_evaluation(self, evaluation: RAGEvaluation):
        self.context_scores.append(
            evaluation.context_relevance
        )
        self.faith_scores.append(evaluation.faithfulness)
        self.relevance_scores.append(
            evaluation.answer_relevance
        )

    def check_trends(self) -> list[str]:
        """Compare recent scores to historical baseline."""
        alerts = []
        if len(self.context_scores) < self.window_size:
            return alerts

        for name, scores in [
            ("context_relevance", self.context_scores),
            ("faithfulness", self.faith_scores),
            ("answer_relevance", self.relevance_scores),
        ]:
            scores_list = list(scores)
            midpoint = len(scores_list) // 2
            first_half_avg = (
                sum(scores_list[:midpoint]) / midpoint
            )
            second_half_avg = (
                sum(scores_list[midpoint:])
                / (len(scores_list) - midpoint)
            )

            drop = first_half_avg - second_half_avg
            if drop > self.alert_threshold:
                alerts.append(
                    f"{name} dropped by {drop:.2%}: "
                    f"{first_half_avg:.2f} -> "
                    f"{second_half_avg:.2f}"
                )

        return alerts

Incorporating User Feedback

Automated evaluation catches technical quality issues, but user feedback captures real-world usefulness. Implement thumbs-up/thumbs-down on every response, track which answers get follow-up questions (indicating the first answer was insufficient), and correlate user feedback with automated scores to calibrate your thresholds.

The combination of automated scoring and user signals gives you a complete picture. Automated scoring runs on every sampled response with consistent criteria. User feedback provides ground truth on actual helpfulness. Together, they enable you to detect problems early, diagnose root causes, and continuously improve your RAG system.

FAQ

What sample rate should I use for automated evaluation?

Start with 10% of queries. This gives you statistically meaningful data without excessive LLM evaluation costs. For critical applications (medical, financial, legal), increase to 25-50%. You can also evaluate 100% of queries from specific user segments or query categories that are high risk.

How quickly can degradation detection catch a problem?

With a 10% sample rate and 100-query window, you need approximately 1,000 queries before the window fills. At high traffic volumes this happens within hours. For faster detection, increase the sample rate or reduce the window size, accepting more noise in exchange for quicker alerts.

Should I use an LLM judge or fine-tuned classifier for evaluation?

Start with an LLM judge (GPT-4o-mini is cost-effective and accurate enough). As you accumulate labeled evaluation data, train a fine-tuned classifier that can evaluate in milliseconds instead of hundreds of milliseconds. The LLM judge becomes your labeling tool, and the classifier becomes your production evaluator.

#RAGEvaluation #ProductionMonitoring #QualityMetrics #ABTesting #MLOps #AgenticAI #LearnAI #AIEngineering