Hybrid Search for RAG: Combining Vector Similarity with Keyword Search
Learn how to implement hybrid search for RAG by combining BM25 keyword search with vector similarity, using reciprocal rank fusion and re-ranking to maximize retrieval quality.
Why Vector Search Alone Is Not Enough
Vector search excels at finding semantically similar content — it knows that "automobile" and "car" are related even though they share no characters. But it has blind spots. When a user searches for a specific error code like ERR_SSL_PROTOCOL_ERROR, an exact product name like iPhone 15 Pro Max, or an acronym like HIPAA, vector similarity can miss the exact match in favor of semantically similar but incorrect results.
Keyword search (BM25) excels at exact matching but fails on semantic understanding. It would not connect "how to terminate an employee" with a document titled "staff separation procedures."
Hybrid search combines both approaches, covering each method's weaknesses with the other's strengths. Production RAG systems at companies like Anthropic, Google, and Microsoft almost universally use hybrid retrieval.
BM25: The Keyword Search Foundation
BM25 (Best Match 25) is a probabilistic ranking function that scores documents based on term frequency, inverse document frequency, and document length normalization:
from rank_bm25 import BM25Okapi
import re
def tokenize(text: str) -> list[str]:
"""Simple whitespace + lowercase tokenizer."""
return re.findall(r"\w+", text.lower())
# Index documents
documents = [
"Enterprise refund policy allows full refunds within 30 days",
"HIPAA compliance checklist for healthcare data processing",
"Staff separation procedures and exit interview guidelines",
"ERR_SSL_PROTOCOL_ERROR troubleshooting for nginx servers",
]
tokenized_docs = [tokenize(doc) for doc in documents]
bm25 = BM25Okapi(tokenized_docs)
# Search
query = "ERR_SSL_PROTOCOL_ERROR"
scores = bm25.get_scores(tokenize(query))
for doc, score in sorted(zip(documents, scores), key=lambda x: -x[1]):
if score > 0:
print(f"[BM25: {score:.2f}] {doc}")
BM25 finds the exact error code match immediately, something vector search might rank lower.
Implementing Hybrid Search from Scratch
Here is a complete hybrid search implementation that combines Chroma vector search with BM25:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from rank_bm25 import BM25Okapi
import numpy as np
from dataclasses import dataclass
@dataclass
class SearchResult:
content: str
metadata: dict
score: float
source: str # "vector", "bm25", or "both"
class HybridRetriever:
def __init__(self, documents: list[dict], persist_dir: str = "./hybrid_db"):
self.documents = documents
texts = [d["content"] for d in documents]
# Build vector index
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
self.vectorstore = Chroma.from_texts(
texts=texts,
embedding=embeddings,
metadatas=[d.get("metadata", {}) for d in documents],
persist_directory=persist_dir,
)
# Build BM25 index
self.tokenized_docs = [self._tokenize(t) for t in texts]
self.bm25 = BM25Okapi(self.tokenized_docs)
self.raw_texts = texts
def _tokenize(self, text: str) -> list[str]:
import re
return re.findall(r"\w+", text.lower())
def search(self, query: str, k: int = 5, alpha: float = 0.7) -> list[SearchResult]:
"""
Hybrid search with reciprocal rank fusion.
alpha: weight for vector search (1-alpha for BM25)
"""
# Vector search
vector_results = self.vectorstore.similarity_search_with_score(query, k=k*2)
# BM25 search
bm25_scores = self.bm25.get_scores(self._tokenize(query))
bm25_ranked = np.argsort(bm25_scores)[::-1][:k*2]
# Reciprocal Rank Fusion
rrf_scores = {}
rrf_constant = 60 # standard RRF constant
# Score vector results
for rank, (doc, _score) in enumerate(vector_results):
doc_key = doc.page_content
rrf_scores[doc_key] = rrf_scores.get(doc_key, 0)
rrf_scores[doc_key] += alpha * (1 / (rrf_constant + rank + 1))
# Score BM25 results
for rank, doc_idx in enumerate(bm25_ranked):
if bm25_scores[doc_idx] > 0:
doc_key = self.raw_texts[doc_idx]
rrf_scores[doc_key] = rrf_scores.get(doc_key, 0)
rrf_scores[doc_key] += (1 - alpha) * (1 / (rrf_constant + rank + 1))
# Sort by combined score and return top k
sorted_results = sorted(rrf_scores.items(), key=lambda x: -x[1])[:k]
return [
SearchResult(content=text, metadata={}, score=score, source="hybrid")
for text, score in sorted_results
]
Reciprocal Rank Fusion Explained
RRF combines ranked lists from different retrieval methods without requiring score normalization. The formula for each document is:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
RRF_score = sum(1 / (k + rank_i)) for each retrieval method i
Where k is a constant (typically 60) that prevents high-ranked documents from dominating. This works because ranks are comparable across methods even when raw scores are not — BM25 scores might range from 0-15 while vector cosine similarities range from 0-1.
Adding a Re-Ranker for Maximum Quality
A cross-encoder re-ranker takes the union of results from both methods and re-scores each document against the query. This is slower but significantly more accurate than bi-encoder similarity:
from sentence_transformers import CrossEncoder
class ReRankedHybridRetriever(HybridRetriever):
def __init__(self, documents, persist_dir="./hybrid_db"):
super().__init__(documents, persist_dir)
self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
def search_with_rerank(
self, query: str, k: int = 5, initial_k: int = 20, alpha: float = 0.7
) -> list[SearchResult]:
# Get initial candidates from hybrid search
candidates = self.search(query, k=initial_k, alpha=alpha)
# Re-rank with cross-encoder
pairs = [(query, c.content) for c in candidates]
rerank_scores = self.reranker.predict(pairs)
# Sort by re-ranker scores
reranked = sorted(
zip(candidates, rerank_scores),
key=lambda x: -x[1]
)
return [
SearchResult(
content=r.content,
metadata=r.metadata,
score=float(score),
source="reranked"
)
for r, score in reranked[:k]
]
The pattern is: retrieve broadly (top 20-50 from hybrid search), then re-rank precisely (pick top 5).
Tuning the Alpha Parameter
The alpha parameter controls the balance between vector and keyword search. Optimal values depend on your data:
def tune_alpha(retriever, eval_queries, expected_docs, k=5):
"""Find the best alpha by sweeping values."""
best_alpha = 0.5
best_recall = 0.0
for alpha in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
hits = 0
for query, expected_id in zip(eval_queries, expected_docs):
results = retriever.search(query, k=k, alpha=alpha)
retrieved = [r.content for r in results]
if any(expected_id in r for r in retrieved):
hits += 1
recall = hits / len(eval_queries)
print(f"alpha={alpha:.1f}: Recall@{k} = {recall:.2%}")
if recall > best_recall:
best_recall = recall
best_alpha = alpha
print(f"\nBest alpha: {best_alpha} (Recall@{k} = {best_recall:.2%})")
return best_alpha
In practice, alpha between 0.5 and 0.7 works well for most RAG applications — slightly favoring vector search while still benefiting from keyword matching.
FAQ
When should I use pure vector search instead of hybrid?
Pure vector search is sufficient when your queries are natural language questions without specific identifiers (no product names, error codes, or acronyms) and your documents are written in consistent natural language. If your corpus contains technical content with specific terms that must match exactly, hybrid search will outperform vector-only retrieval.
Is re-ranking worth the added latency?
Re-ranking adds 50-200ms depending on the model and number of candidates. For user-facing applications where answer quality matters more than sub-second latency, re-ranking consistently improves retrieval quality by 10-25% on standard benchmarks. For high-throughput batch processing where latency is critical, skip re-ranking.
Can I use hybrid search with Pinecone or pgvector?
Pinecone supports metadata filtering but not true BM25 keyword search. Weaviate has native hybrid search built in. For pgvector, you can implement BM25 separately using PostgreSQL full-text search (tsvector and tsquery) and combine results in your application layer using RRF, which works well since everything lives in the same database.
#RAG #HybridSearch #BM25 #VectorSearch #Reranking #InformationRetrieval #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.