Skip to content
Learn Agentic AI12 min read0 views

Hybrid Search for RAG: Combining Vector Similarity with Keyword Search

Learn how to implement hybrid search for RAG by combining BM25 keyword search with vector similarity, using reciprocal rank fusion and re-ranking to maximize retrieval quality.

Why Vector Search Alone Is Not Enough

Vector search excels at finding semantically similar content — it knows that "automobile" and "car" are related even though they share no characters. But it has blind spots. When a user searches for a specific error code like ERR_SSL_PROTOCOL_ERROR, an exact product name like iPhone 15 Pro Max, or an acronym like HIPAA, vector similarity can miss the exact match in favor of semantically similar but incorrect results.

Keyword search (BM25) excels at exact matching but fails on semantic understanding. It would not connect "how to terminate an employee" with a document titled "staff separation procedures."

Hybrid search combines both approaches, covering each method's weaknesses with the other's strengths. Production RAG systems at companies like Anthropic, Google, and Microsoft almost universally use hybrid retrieval.

BM25: The Keyword Search Foundation

BM25 (Best Match 25) is a probabilistic ranking function that scores documents based on term frequency, inverse document frequency, and document length normalization:

from rank_bm25 import BM25Okapi
import re

def tokenize(text: str) -> list[str]:
    """Simple whitespace + lowercase tokenizer."""
    return re.findall(r"\w+", text.lower())

# Index documents
documents = [
    "Enterprise refund policy allows full refunds within 30 days",
    "HIPAA compliance checklist for healthcare data processing",
    "Staff separation procedures and exit interview guidelines",
    "ERR_SSL_PROTOCOL_ERROR troubleshooting for nginx servers",
]

tokenized_docs = [tokenize(doc) for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

# Search
query = "ERR_SSL_PROTOCOL_ERROR"
scores = bm25.get_scores(tokenize(query))

for doc, score in sorted(zip(documents, scores), key=lambda x: -x[1]):
    if score > 0:
        print(f"[BM25: {score:.2f}] {doc}")

BM25 finds the exact error code match immediately, something vector search might rank lower.

Implementing Hybrid Search from Scratch

Here is a complete hybrid search implementation that combines Chroma vector search with BM25:

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from rank_bm25 import BM25Okapi
import numpy as np
from dataclasses import dataclass

@dataclass
class SearchResult:
    content: str
    metadata: dict
    score: float
    source: str  # "vector", "bm25", or "both"

class HybridRetriever:
    def __init__(self, documents: list[dict], persist_dir: str = "./hybrid_db"):
        self.documents = documents
        texts = [d["content"] for d in documents]

        # Build vector index
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Chroma.from_texts(
            texts=texts,
            embedding=embeddings,
            metadatas=[d.get("metadata", {}) for d in documents],
            persist_directory=persist_dir,
        )

        # Build BM25 index
        self.tokenized_docs = [self._tokenize(t) for t in texts]
        self.bm25 = BM25Okapi(self.tokenized_docs)
        self.raw_texts = texts

    def _tokenize(self, text: str) -> list[str]:
        import re
        return re.findall(r"\w+", text.lower())

    def search(self, query: str, k: int = 5, alpha: float = 0.7) -> list[SearchResult]:
        """
        Hybrid search with reciprocal rank fusion.
        alpha: weight for vector search (1-alpha for BM25)
        """
        # Vector search
        vector_results = self.vectorstore.similarity_search_with_score(query, k=k*2)

        # BM25 search
        bm25_scores = self.bm25.get_scores(self._tokenize(query))
        bm25_ranked = np.argsort(bm25_scores)[::-1][:k*2]

        # Reciprocal Rank Fusion
        rrf_scores = {}
        rrf_constant = 60  # standard RRF constant

        # Score vector results
        for rank, (doc, _score) in enumerate(vector_results):
            doc_key = doc.page_content
            rrf_scores[doc_key] = rrf_scores.get(doc_key, 0)
            rrf_scores[doc_key] += alpha * (1 / (rrf_constant + rank + 1))

        # Score BM25 results
        for rank, doc_idx in enumerate(bm25_ranked):
            if bm25_scores[doc_idx] > 0:
                doc_key = self.raw_texts[doc_idx]
                rrf_scores[doc_key] = rrf_scores.get(doc_key, 0)
                rrf_scores[doc_key] += (1 - alpha) * (1 / (rrf_constant + rank + 1))

        # Sort by combined score and return top k
        sorted_results = sorted(rrf_scores.items(), key=lambda x: -x[1])[:k]
        return [
            SearchResult(content=text, metadata={}, score=score, source="hybrid")
            for text, score in sorted_results
        ]

Reciprocal Rank Fusion Explained

RRF combines ranked lists from different retrieval methods without requiring score normalization. The formula for each document is:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

RRF_score = sum(1 / (k + rank_i)) for each retrieval method i

Where k is a constant (typically 60) that prevents high-ranked documents from dominating. This works because ranks are comparable across methods even when raw scores are not — BM25 scores might range from 0-15 while vector cosine similarities range from 0-1.

Adding a Re-Ranker for Maximum Quality

A cross-encoder re-ranker takes the union of results from both methods and re-scores each document against the query. This is slower but significantly more accurate than bi-encoder similarity:

from sentence_transformers import CrossEncoder

class ReRankedHybridRetriever(HybridRetriever):
    def __init__(self, documents, persist_dir="./hybrid_db"):
        super().__init__(documents, persist_dir)
        self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

    def search_with_rerank(
        self, query: str, k: int = 5, initial_k: int = 20, alpha: float = 0.7
    ) -> list[SearchResult]:
        # Get initial candidates from hybrid search
        candidates = self.search(query, k=initial_k, alpha=alpha)

        # Re-rank with cross-encoder
        pairs = [(query, c.content) for c in candidates]
        rerank_scores = self.reranker.predict(pairs)

        # Sort by re-ranker scores
        reranked = sorted(
            zip(candidates, rerank_scores),
            key=lambda x: -x[1]
        )

        return [
            SearchResult(
                content=r.content,
                metadata=r.metadata,
                score=float(score),
                source="reranked"
            )
            for r, score in reranked[:k]
        ]

The pattern is: retrieve broadly (top 20-50 from hybrid search), then re-rank precisely (pick top 5).

Tuning the Alpha Parameter

The alpha parameter controls the balance between vector and keyword search. Optimal values depend on your data:

def tune_alpha(retriever, eval_queries, expected_docs, k=5):
    """Find the best alpha by sweeping values."""
    best_alpha = 0.5
    best_recall = 0.0

    for alpha in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
        hits = 0
        for query, expected_id in zip(eval_queries, expected_docs):
            results = retriever.search(query, k=k, alpha=alpha)
            retrieved = [r.content for r in results]
            if any(expected_id in r for r in retrieved):
                hits += 1
        recall = hits / len(eval_queries)
        print(f"alpha={alpha:.1f}: Recall@{k} = {recall:.2%}")
        if recall > best_recall:
            best_recall = recall
            best_alpha = alpha

    print(f"\nBest alpha: {best_alpha} (Recall@{k} = {best_recall:.2%})")
    return best_alpha

In practice, alpha between 0.5 and 0.7 works well for most RAG applications — slightly favoring vector search while still benefiting from keyword matching.

FAQ

When should I use pure vector search instead of hybrid?

Pure vector search is sufficient when your queries are natural language questions without specific identifiers (no product names, error codes, or acronyms) and your documents are written in consistent natural language. If your corpus contains technical content with specific terms that must match exactly, hybrid search will outperform vector-only retrieval.

Is re-ranking worth the added latency?

Re-ranking adds 50-200ms depending on the model and number of candidates. For user-facing applications where answer quality matters more than sub-second latency, re-ranking consistently improves retrieval quality by 10-25% on standard benchmarks. For high-throughput batch processing where latency is critical, skip re-ranking.

Can I use hybrid search with Pinecone or pgvector?

Pinecone supports metadata filtering but not true BM25 keyword search. Weaviate has native hybrid search built in. For pgvector, you can implement BM25 separately using PostgreSQL full-text search (tsvector and tsquery) and combine results in your application layer using RRF, which works well since everything lives in the same database.


#RAG #HybridSearch #BM25 #VectorSearch #Reranking #InformationRetrieval #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.