Evaluating RAG in Production: Building Automated Quality Monitoring for Retrieval Systems
Learn how to build comprehensive RAG evaluation systems with online metrics, user feedback loops, automated quality scoring, A/B testing, and degradation detection for production retrieval pipelines.
Why Offline Evaluation Is Not Enough
Most teams evaluate their RAG system once during development using a curated test set, declare the results acceptable, and ship to production. Then reality hits. Documents get updated, new content is added, user query patterns shift, and embedding model behavior drifts on edge cases. The system that scored 85% on your test set six weeks ago might be producing incorrect answers 30% of the time today, and nobody knows until users complain.
Production RAG evaluation must be continuous, automated, and multi-dimensional. You need to monitor retrieval quality, generation faithfulness, and user satisfaction — all in real time.
The Four Pillars of RAG Evaluation
1. Retrieval Quality
Are the right documents being retrieved? Measured by context relevance and recall.
2. Generation Faithfulness
Is the LLM's answer actually supported by the retrieved documents? Measured by groundedness.
3. Answer Correctness
Does the answer actually address the user's question? Measured by answer relevance.
4. User Satisfaction
Do users find the answers helpful? Measured by explicit feedback and behavioral signals.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Building an Automated Quality Scorer
from openai import OpenAI
from dataclasses import dataclass
from datetime import datetime
import json
client = OpenAI()
@dataclass
class RAGEvaluation:
query: str
retrieved_docs: list[str]
generated_answer: str
context_relevance: float
faithfulness: float
answer_relevance: float
timestamp: datetime
def evaluate_context_relevance(
query: str, documents: list[str]
) -> float:
"""Score how relevant retrieved documents are to the query.
Returns 0.0 to 1.0."""
scores = []
for doc in documents:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": """Rate the relevance of this document
to the query on a scale of 0.0 to 1.0.
Return JSON: {"score": 0.X, "reason": "..."}"""
}, {
"role": "user",
"content": f"Query: {query}\nDocument: {doc}"
}],
response_format={"type": "json_object"}
)
result = json.loads(
response.choices[0].message.content
)
scores.append(result["score"])
return sum(scores) / len(scores) if scores else 0.0
def evaluate_faithfulness(
answer: str, documents: list[str]
) -> float:
"""Score whether the answer is grounded in the documents.
Returns 0.0 to 1.0."""
context = "\n\n".join(documents)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": """Evaluate if each claim in the answer
is supported by the provided documents.
Return JSON:
{
"claims": [
{"claim": "...", "supported": true/false}
],
"faithfulness_score": 0.0-1.0
}"""
}, {
"role": "user",
"content": (
f"Documents:\n{context}\n\n"
f"Answer:\n{answer}"
)
}],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return result["faithfulness_score"]
def evaluate_answer_relevance(
query: str, answer: str
) -> float:
"""Score whether the answer addresses the question.
Returns 0.0 to 1.0."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": """Rate how well the answer addresses
the user's question on a scale of 0.0 to 1.0.
Return JSON: {"score": 0.X, "reason": "..."}"""
}, {
"role": "user",
"content": f"Question: {query}\nAnswer: {answer}"
}],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
return result["score"]
Integrating Evaluation into Your RAG Pipeline
import logging
logger = logging.getLogger("rag_eval")
class MonitoredRAGPipeline:
def __init__(self, retriever, eval_sample_rate: float = 0.1):
self.retriever = retriever
self.sample_rate = eval_sample_rate
self.evaluations: list[RAGEvaluation] = []
def answer(self, query: str) -> str:
"""Answer with optional quality evaluation."""
import random
# Retrieve and generate as normal
docs = self.retriever.search(query, k=5)
doc_texts = [d.page_content for d in docs]
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Answer using the provided context."
}, {
"role": "user",
"content": (
f"Context:\n{'chr(10)'.join(doc_texts)}"
f"\n\nQuestion: {query}"
)
}],
)
answer = response.choices[0].message.content
# Evaluate a sample of responses
if random.random() < self.sample_rate:
self._async_evaluate(query, doc_texts, answer)
return answer
def _async_evaluate(
self, query: str, docs: list[str], answer: str
):
"""Run evaluation asynchronously to avoid
adding latency to the response."""
import threading
def evaluate():
try:
eval_result = RAGEvaluation(
query=query,
retrieved_docs=docs,
generated_answer=answer,
context_relevance=evaluate_context_relevance(
query, docs
),
faithfulness=evaluate_faithfulness(
answer, docs
),
answer_relevance=evaluate_answer_relevance(
query, answer
),
timestamp=datetime.now(),
)
self.evaluations.append(eval_result)
self._check_degradation(eval_result)
except Exception as e:
logger.error(f"Evaluation failed: {e}")
thread = threading.Thread(target=evaluate)
thread.start()
def _check_degradation(self, evaluation: RAGEvaluation):
"""Alert if quality drops below thresholds."""
thresholds = {
"context_relevance": 0.6,
"faithfulness": 0.7,
"answer_relevance": 0.6,
}
for metric, threshold in thresholds.items():
value = getattr(evaluation, metric)
if value < threshold:
logger.warning(
f"Quality degradation detected: "
f"{metric}={value:.2f} < {threshold} "
f"for query: {evaluation.query[:100]}"
)
Building a Degradation Detection System
Track rolling averages to detect systemic quality drops, not just individual bad answers:
from collections import deque
class DegradationDetector:
def __init__(self, window_size: int = 100):
self.window_size = window_size
self.context_scores = deque(maxlen=window_size)
self.faith_scores = deque(maxlen=window_size)
self.relevance_scores = deque(maxlen=window_size)
self.alert_threshold = 0.1 # 10% drop triggers alert
def add_evaluation(self, evaluation: RAGEvaluation):
self.context_scores.append(
evaluation.context_relevance
)
self.faith_scores.append(evaluation.faithfulness)
self.relevance_scores.append(
evaluation.answer_relevance
)
def check_trends(self) -> list[str]:
"""Compare recent scores to historical baseline."""
alerts = []
if len(self.context_scores) < self.window_size:
return alerts
for name, scores in [
("context_relevance", self.context_scores),
("faithfulness", self.faith_scores),
("answer_relevance", self.relevance_scores),
]:
scores_list = list(scores)
midpoint = len(scores_list) // 2
first_half_avg = (
sum(scores_list[:midpoint]) / midpoint
)
second_half_avg = (
sum(scores_list[midpoint:])
/ (len(scores_list) - midpoint)
)
drop = first_half_avg - second_half_avg
if drop > self.alert_threshold:
alerts.append(
f"{name} dropped by {drop:.2%}: "
f"{first_half_avg:.2f} -> "
f"{second_half_avg:.2f}"
)
return alerts
Incorporating User Feedback
Automated evaluation catches technical quality issues, but user feedback captures real-world usefulness. Implement thumbs-up/thumbs-down on every response, track which answers get follow-up questions (indicating the first answer was insufficient), and correlate user feedback with automated scores to calibrate your thresholds.
The combination of automated scoring and user signals gives you a complete picture. Automated scoring runs on every sampled response with consistent criteria. User feedback provides ground truth on actual helpfulness. Together, they enable you to detect problems early, diagnose root causes, and continuously improve your RAG system.
FAQ
What sample rate should I use for automated evaluation?
Start with 10% of queries. This gives you statistically meaningful data without excessive LLM evaluation costs. For critical applications (medical, financial, legal), increase to 25-50%. You can also evaluate 100% of queries from specific user segments or query categories that are high risk.
How quickly can degradation detection catch a problem?
With a 10% sample rate and 100-query window, you need approximately 1,000 queries before the window fills. At high traffic volumes this happens within hours. For faster detection, increase the sample rate or reduce the window size, accepting more noise in exchange for quicker alerts.
Should I use an LLM judge or fine-tuned classifier for evaluation?
Start with an LLM judge (GPT-4o-mini is cost-effective and accurate enough). As you accumulate labeled evaluation data, train a fine-tuned classifier that can evaluate in milliseconds instead of hundreds of milliseconds. The LLM judge becomes your labeling tool, and the classifier becomes your production evaluator.
#RAGEvaluation #ProductionMonitoring #QualityMetrics #ABTesting #MLOps #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.