Semantic Search Evaluation: nDCG, MRR, and Recall at K Metrics

Why Search Evaluation Matters

Building a semantic search system without proper evaluation is like developing software without tests. You cannot reliably improve what you cannot measure. Search evaluation metrics quantify how well your system ranks relevant results, enabling data-driven decisions about model selection, parameter tuning, and architectural changes.

Three metrics form the foundation of search evaluation: Recall@K measures how many relevant documents you retrieve, MRR measures how quickly you surface the first relevant result, and nDCG measures the quality of the entire ranked list.

Recall at K

Recall@K answers: "Of all relevant documents, how many did we return in the top K results?"

from typing import List, Set
import numpy as np

def recall_at_k(
    retrieved: List[str],
    relevant: Set[str],
    k: int,
) -> float:
    """Calculate Recall@K.

    Args:
        retrieved: Ordered list of retrieved document IDs.
        relevant: Set of all relevant document IDs.
        k: Number of top results to consider.

    Returns:
        Float between 0 and 1.
    """
    if not relevant:
        return 0.0
    top_k = set(retrieved[:k])
    hits = top_k.intersection(relevant)
    return len(hits) / len(relevant)

# Example
retrieved = ["doc_3", "doc_7", "doc_1", "doc_9", "doc_5"]
relevant = {"doc_1", "doc_5", "doc_12"}

print(f"Recall@3: {recall_at_k(retrieved, relevant, 3):.2f}")  # 0.33
print(f"Recall@5: {recall_at_k(retrieved, relevant, 5):.2f}")  # 0.67

Recall@K is essential for retrieval-augmented generation (RAG) systems where missing a relevant document means the LLM cannot use it. Aim for Recall@10 above 0.85 for RAG pipelines.

Mean Reciprocal Rank (MRR)

MRR answers: "On average, how far down the result list is the first relevant document?"

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def reciprocal_rank(
    retrieved: List[str],
    relevant: Set[str],
) -> float:
    """Calculate reciprocal rank for a single query."""
    for i, doc_id in enumerate(retrieved):
        if doc_id in relevant:
            return 1.0 / (i + 1)
    return 0.0

def mean_reciprocal_rank(
    queries: List[dict],
) -> float:
    """Calculate MRR across multiple queries.

    Each query dict has 'retrieved' and 'relevant' keys.
    """
    rr_scores = [
        reciprocal_rank(q["retrieved"], set(q["relevant"]))
        for q in queries
    ]
    return np.mean(rr_scores) if rr_scores else 0.0

# Example
queries = [
    {
        "retrieved": ["doc_3", "doc_1", "doc_7"],
        "relevant": ["doc_1"],
    },  # RR = 1/2 = 0.5
    {
        "retrieved": ["doc_5", "doc_2", "doc_8"],
        "relevant": ["doc_5"],
    },  # RR = 1/1 = 1.0
    {
        "retrieved": ["doc_4", "doc_6", "doc_9"],
        "relevant": ["doc_11"],
    },  # RR = 0.0
]
print(f"MRR: {mean_reciprocal_rank(queries):.3f}")  # 0.500

MRR is ideal for search experiences where users typically only click the first relevant result, like question-answering or navigational search.

Normalized Discounted Cumulative Gain (nDCG)

nDCG is the gold standard for search evaluation. It measures ranking quality while accounting for the position of each relevant result — a relevant document at position 1 is worth more than the same document at position 5.

def dcg_at_k(relevance_scores: List[float], k: int) -> float:
    """Calculate Discounted Cumulative Gain at K."""
    scores = relevance_scores[:k]
    gains = []
    for i, score in enumerate(scores):
        discount = np.log2(i + 2)  # +2 because positions are 1-indexed
        gains.append(score / discount)
    return sum(gains)

def ndcg_at_k(
    retrieved: List[str],
    relevance_map: dict,  # {doc_id: relevance_score}
    k: int,
) -> float:
    """Calculate nDCG@K.

    Args:
        retrieved: Ordered list of retrieved document IDs.
        relevance_map: Maps doc_id to graded relevance (0, 1, 2, 3).
        k: Cutoff position.

    Returns:
        Float between 0 and 1.
    """
    # Actual relevance scores in retrieved order
    actual_scores = [
        relevance_map.get(doc_id, 0) for doc_id in retrieved[:k]
    ]
    actual_dcg = dcg_at_k(actual_scores, k)

    # Ideal ordering: sort all relevance scores descending
    ideal_scores = sorted(relevance_map.values(), reverse=True)
    ideal_dcg = dcg_at_k(ideal_scores, k)

    if ideal_dcg == 0:
        return 0.0
    return actual_dcg / ideal_dcg

# Example with graded relevance (0=irrelevant, 1=marginal, 2=relevant, 3=highly relevant)
retrieved = ["doc_A", "doc_B", "doc_C", "doc_D", "doc_E"]
relevance = {
    "doc_A": 2,  # relevant
    "doc_B": 0,  # irrelevant
    "doc_C": 3,  # highly relevant
    "doc_D": 1,  # marginal
    "doc_F": 3,  # relevant but not retrieved
}
print(f"nDCG@5: {ndcg_at_k(retrieved, relevance, 5):.3f}")

Building a Test Set

Evaluation is only as good as your test set. Here is a structured approach to creating one.

from dataclasses import dataclass, field
from typing import Optional
import json

@dataclass
class SearchTestCase:
    query: str
    relevant_docs: dict  # {doc_id: relevance_grade}
    category: str = "general"
    difficulty: str = "medium"  # easy, medium, hard
    notes: Optional[str] = None

class TestSetBuilder:
    def __init__(self):
        self.test_cases: List[SearchTestCase] = []

    def add_from_query_log(
        self, query: str, clicked_docs: List[str], shown_docs: List[str]
    ):
        """Create a test case from click-through data."""
        relevance = {}
        for doc_id in clicked_docs:
            relevance[doc_id] = 2  # clicked = relevant
        for doc_id in shown_docs:
            if doc_id not in relevance:
                relevance[doc_id] = 0  # shown but not clicked
        self.test_cases.append(SearchTestCase(
            query=query,
            relevant_docs=relevance,
            category="click_log",
        ))

    def add_manual(
        self, query: str, relevance: dict, difficulty: str = "medium"
    ):
        """Add a manually annotated test case."""
        self.test_cases.append(SearchTestCase(
            query=query,
            relevant_docs=relevance,
            difficulty=difficulty,
        ))

    def save(self, path: str):
        data = [
            {
                "query": tc.query,
                "relevant_docs": tc.relevant_docs,
                "category": tc.category,
                "difficulty": tc.difficulty,
            }
            for tc in self.test_cases
        ]
        with open(path, "w") as f:
            json.dump(data, f, indent=2)

    def load(self, path: str):
        with open(path) as f:
            data = json.load(f)
        self.test_cases = [
            SearchTestCase(**item) for item in data
        ]

Running a Benchmark

class SearchBenchmark:
    def __init__(self, test_cases: List[SearchTestCase]):
        self.test_cases = test_cases

    def evaluate(
        self, search_fn, k_values: List[int] = None
    ) -> dict:
        """Evaluate a search function against the test set."""
        if k_values is None:
            k_values = [1, 3, 5, 10]

        metrics = {f"ndcg@{k}": [] for k in k_values}
        metrics.update({f"recall@{k}": [] for k in k_values})
        metrics["mrr"] = []

        for tc in self.test_cases:
            results = search_fn(tc.query)
            retrieved_ids = [r["id"] for r in results]
            relevant_set = set(tc.relevant_docs.keys())

            for k in k_values:
                ndcg = ndcg_at_k(retrieved_ids, tc.relevant_docs, k)
                metrics[f"ndcg@{k}"].append(ndcg)
                rec = recall_at_k(retrieved_ids, relevant_set, k)
                metrics[f"recall@{k}"].append(rec)

            rr = reciprocal_rank(retrieved_ids, relevant_set)
            metrics["mrr"].append(rr)

        return {
            name: float(np.mean(values))
            for name, values in metrics.items()
        }

FAQ

How many test queries do I need for reliable evaluation?

Aim for at least 50 queries for directional insights and 200+ queries for statistically significant comparisons between search systems. Include a mix of query types: short keyword queries, natural language questions, ambiguous queries, and queries with no relevant results. Balance across your content categories.

Should I use binary or graded relevance judgments?

Graded relevance (0-3 scale) is more informative than binary (relevant/not relevant) because it captures the difference between a perfect answer and a marginally related document. Use graded relevance with nDCG for ranking evaluation, and binary relevance with Recall@K and MRR for simpler pass/fail evaluation. If manual annotation budget is limited, binary judgments are faster to produce.

How do I detect when search quality has degraded over time?

Run your benchmark suite as part of your CI/CD pipeline or on a daily schedule. Set threshold alerts: if nDCG@10 drops more than 5% from the baseline, trigger a notification. Track metrics over time in a dashboard. Quality degradation often comes from data drift — new documents that shift the embedding space — rather than code changes.

#SearchEvaluation #NDCG #MRR #RecallK #InformationRetrieval #AgenticAI #LearnAI #AIEngineering

Semantic Search Evaluation: nDCG, MRR, and Recall at K Metrics

Why Search Evaluation Matters

Recall at K

Mean Reciprocal Rank (MRR)

Normalized Discounted Cumulative Gain (nDCG)

Building a Test Set

Running a Benchmark

FAQ

How many test queries do I need for reliable evaluation?

Should I use binary or graded relevance judgments?

How do I detect when search quality has degraded over time?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding