Re-Ranking Search Results with Cross-Encoders: Improving Retrieval Precision
Understand the difference between bi-encoders and cross-encoders, then build a re-ranking pipeline that dramatically improves search precision by scoring query-document pairs jointly rather than independently.
The Precision Problem in First-Stage Retrieval
Bi-encoder models (like sentence-transformers) embed queries and documents independently, then compare them with cosine similarity. This independence is what makes them fast — you can pre-compute document embeddings — but it also limits their accuracy. A bi-encoder cannot model fine-grained interactions between specific query terms and specific document phrases.
Cross-encoders solve this by processing the query and document together as a single input pair, allowing the transformer's attention layers to directly compare every query token against every document token. The result is significantly higher precision, at the cost of speed.
Bi-Encoder vs Cross-Encoder
The key architectural difference:
- Bi-encoder: Embeds query and document separately, compares with dot product. Fast (pre-compute docs), but lower precision.
- Cross-encoder: Concatenates query + document, passes through transformer together, outputs a single relevance score. Slow (must run for each pair), but much higher precision.
The standard pattern is a two-stage pipeline: use a bi-encoder to retrieve the top 50-100 candidates quickly, then re-rank those candidates with a cross-encoder.
Building the Re-Ranking Pipeline
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from typing import List, Dict, Tuple
class TwoStageSearchPipeline:
def __init__(
self,
bi_encoder_name: str = "all-MiniLM-L6-v2",
cross_encoder_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
):
self.bi_encoder = SentenceTransformer(bi_encoder_name)
self.cross_encoder = CrossEncoder(cross_encoder_name)
self.doc_embeddings = None
self.documents = []
def index_documents(self, documents: List[Dict]):
"""Pre-compute bi-encoder embeddings for all documents."""
self.documents = documents
texts = [f"{d['title']}. {d['body']}" for d in documents]
self.doc_embeddings = self.bi_encoder.encode(
texts, normalize_embeddings=True, show_progress_bar=True
)
def first_stage_retrieve(
self, query: str, top_k: int = 50
) -> List[Tuple[int, float]]:
"""Fast retrieval using bi-encoder similarity."""
query_emb = self.bi_encoder.encode(
[query], normalize_embeddings=True
)
scores = np.dot(self.doc_embeddings, query_emb.T).flatten()
top_indices = np.argsort(scores)[::-1][:top_k]
return [(idx, scores[idx]) for idx in top_indices]
def re_rank(
self, query: str, candidates: List[Tuple[int, float]], top_k: int = 10
) -> List[Dict]:
"""Re-rank candidates using cross-encoder."""
pairs = []
for idx, _ in candidates:
doc = self.documents[idx]
text = f"{doc['title']}. {doc['body']}"
pairs.append((query, text))
# Cross-encoder scores all pairs jointly
ce_scores = self.cross_encoder.predict(pairs)
# Sort by cross-encoder score
scored = list(zip(candidates, ce_scores))
scored.sort(key=lambda x: x[1], reverse=True)
results = []
for (idx, bi_score), ce_score in scored[:top_k]:
doc = self.documents[idx].copy()
doc["bi_encoder_score"] = float(bi_score)
doc["cross_encoder_score"] = float(ce_score)
results.append(doc)
return results
def search(self, query: str, retrieve_k: int = 50, final_k: int = 10):
candidates = self.first_stage_retrieve(query, top_k=retrieve_k)
return self.re_rank(query, candidates, top_k=final_k)
Choosing the Right Cross-Encoder Model
Model selection depends on your latency budget:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
# Model comparison (approximate, on CPU)
CROSS_ENCODER_MODELS = {
# Model name: (params, ms/pair, nDCG@10 on MS MARCO)
"cross-encoder/ms-marco-TinyBERT-L-2-v2": ("4.4M", 1.5, 0.325),
"cross-encoder/ms-marco-MiniLM-L-6-v2": ("22.7M", 4.0, 0.349),
"cross-encoder/ms-marco-MiniLM-L-12-v2": ("33.4M", 8.0, 0.357),
"cross-encoder/ms-marco-electra-base": ("109M", 12.0, 0.365),
}
def select_model(latency_budget_ms: float, num_candidates: int) -> str:
"""Select the best model that fits within the latency budget."""
for name, (params, ms_per_pair, quality) in sorted(
CROSS_ENCODER_MODELS.items(),
key=lambda x: x[1][2],
reverse=True, # prefer higher quality
):
total_latency = ms_per_pair * num_candidates
if total_latency <= latency_budget_ms:
return name
return "cross-encoder/ms-marco-TinyBERT-L-2-v2" # fallback
Managing Latency
Cross-encoders are expensive. Re-ranking 100 candidates with a 12-layer model at 8ms per pair takes 800ms. Strategies to reduce this:
- Reduce candidate count — retrieve 30-50 instead of 100. Diminishing returns beyond the top 50.
- Use smaller models — TinyBERT at 1.5ms/pair re-ranks 50 candidates in 75ms.
- Batch on GPU — GPU batching drops per-pair time by 10x.
- Cache re-ranked results — popular queries hit the same documents repeatedly.
from functools import lru_cache
import hashlib
class CachedReRanker:
def __init__(self, cross_encoder: CrossEncoder, cache_size: int = 1024):
self.cross_encoder = cross_encoder
self._cache = {}
self.cache_size = cache_size
def _cache_key(self, query: str, doc_text: str) -> str:
combined = f"{query}|||{doc_text}"
return hashlib.md5(combined.encode()).hexdigest()
def predict(self, pairs: list) -> list:
scores = []
uncached_pairs = []
uncached_indices = []
for i, (query, doc) in enumerate(pairs):
key = self._cache_key(query, doc)
if key in self._cache:
scores.append(self._cache[key])
else:
scores.append(None)
uncached_pairs.append((query, doc))
uncached_indices.append(i)
if uncached_pairs:
new_scores = self.cross_encoder.predict(uncached_pairs)
for idx, score in zip(uncached_indices, new_scores):
key = self._cache_key(*pairs[idx])
self._cache[key] = float(score)
scores[idx] = float(score)
return scores
Measuring the Impact
Re-ranking typically improves nDCG@10 by 15-30% over bi-encoder-only retrieval. The improvement is most pronounced for ambiguous or complex queries where surface-level similarity is misleading.
FAQ
When should I skip re-ranking and use only a bi-encoder?
Skip re-ranking when latency is critical (under 50ms), when your corpus is small enough that a flat exact search is already precise, or when queries are simple keyword lookups. Re-ranking shines on natural language questions and long-form queries where nuance matters.
Can I fine-tune a cross-encoder on my own data?
Yes, and it is one of the highest-impact improvements you can make. Collect query-document relevance pairs from click logs or manual annotations. Even 1,000-2,000 labeled pairs can significantly boost domain-specific precision. Use the sentence-transformers training API with CrossEncoder.fit().
How many candidates should the first stage retrieve for re-ranking?
Start with 50 candidates. Going beyond 100 rarely improves final results because relevant documents almost always appear in the top 50 of a decent bi-encoder. Profile your pipeline to find the sweet spot between recall and re-ranking latency.
#CrossEncoder #ReRanking #SemanticSearch #InformationRetrieval #NLP #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.