Multi-Index RAG: Searching Across Multiple Document Collections Simultaneously
Learn how to build a multi-index RAG system that routes queries to appropriate collections, merges results, and normalizes relevance scores across heterogeneous document stores.
Why One Index Is Not Enough
Real organizations do not store all their knowledge in a single place. Product documentation lives in Confluence, customer conversations sit in a CRM, financial data resides in data warehouses, and research papers are in a separate repository. Each source has different document structures, update frequencies, and access patterns.
A single vector index that ingests everything creates problems. Embedding models optimized for technical documentation perform poorly on conversational support tickets. Chunking strategies that work for structured reports break down on free-form emails. And when your index grows to millions of documents, retrieval precision degrades because unrelated domains pollute each other's embedding space.
Multi-index RAG solves this by maintaining separate, optimized indexes for each document collection and intelligently routing queries to the right ones.
Architecture of Multi-Index RAG
A multi-index RAG system has three components working together:
- Index registry — Metadata about each collection: what it contains, when it was last updated, and what embedding model it uses
- Query router — Determines which indexes are relevant for a given query
- Result merger — Combines results from multiple indexes with normalized scoring
Building the Index Registry and Router
from dataclasses import dataclass, field
from openai import OpenAI
client = OpenAI()
@dataclass
class IndexConfig:
name: str
description: str
vectorstore: object # FAISS, Pinecone, etc.
embedding_model: str
doc_count: int
domains: list[str] = field(default_factory=list)
class MultiIndexRAG:
def __init__(self, indexes: list[IndexConfig]):
self.indexes = {idx.name: idx for idx in indexes}
self.index_descriptions = "\n".join(
f"- {idx.name}: {idx.description} "
f"(domains: {', '.join(idx.domains)})"
for idx in indexes
)
def route_query(self, query: str) -> list[str]:
"""Use LLM to decide which indexes to search."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": f"""Given a user query, select which
indexes to search. Available indexes:
{self.index_descriptions}
Return a JSON object with:
- indexes: list of index names to search
- reasoning: why these indexes were chosen"""
}, {
"role": "user",
"content": query
}],
response_format={"type": "json_object"}
)
import json
result = json.loads(
response.choices[0].message.content
)
return result["indexes"]
Normalizing Scores Across Indexes
Different vector stores return scores on different scales. FAISS returns L2 distances (lower is better), Pinecone returns cosine similarity (higher is better), and Chroma returns its own scoring. You must normalize before merging:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
@dataclass
class ScoredResult:
content: str
source_index: str
raw_score: float
normalized_score: float
def normalize_scores(
results: list[tuple[str, float]],
score_type: str = "cosine",
) -> list[tuple[str, float]]:
"""Normalize scores to 0-1 range."""
if not results:
return []
scores = [s for _, s in results]
min_s, max_s = min(scores), max(scores)
if max_s == min_s:
return [(doc, 1.0) for doc, _ in results]
if score_type == "distance":
# Lower distance = better, invert the scale
return [
(doc, 1.0 - (s - min_s) / (max_s - min_s))
for doc, s in results
]
else:
# Higher similarity = better
return [
(doc, (s - min_s) / (max_s - min_s))
for doc, s in results
]
Full Search and Merge Pipeline
import asyncio
from concurrent.futures import ThreadPoolExecutor
class MultiIndexRAG:
# ... (previous methods)
def search_single_index(
self, index_name: str, query: str, k: int = 5
) -> list[ScoredResult]:
"""Search a single index and normalize results."""
config = self.indexes[index_name]
raw_results = config.vectorstore.similarity_search_with_score(
query, k=k
)
normalized = normalize_scores(
[(doc.page_content, score)
for doc, score in raw_results],
score_type="cosine"
)
return [
ScoredResult(
content=content,
source_index=index_name,
raw_score=raw_results[i][1],
normalized_score=norm_score,
)
for i, (content, norm_score) in enumerate(normalized)
]
def search(
self, query: str, k_per_index: int = 5, top_k: int = 10
) -> list[ScoredResult]:
"""Search across multiple indexes in parallel."""
# Step 1: Route query to relevant indexes
target_indexes = self.route_query(query)
# Step 2: Search all selected indexes in parallel
all_results = []
with ThreadPoolExecutor() as executor:
futures = {
executor.submit(
self.search_single_index,
idx_name, query, k_per_index
): idx_name
for idx_name in target_indexes
}
for future in futures:
all_results.extend(future.result())
# Step 3: Sort by normalized score and return top-K
all_results.sort(
key=lambda r: r.normalized_score, reverse=True
)
return all_results[:top_k]
Keyword-Based Routing as a Fast Alternative
LLM-based routing adds latency. For production systems with predictable query patterns, use keyword or classifier-based routing instead:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
class FastRouter:
def __init__(self):
self.vectorizer = TfidfVectorizer(max_features=5000)
self.classifier = LogisticRegression(multi_label=True)
def train(
self,
queries: list[str],
labels: list[list[str]],
):
"""Train router on historical query-to-index mappings."""
X = self.vectorizer.fit_transform(queries)
# Multi-label binarize and train
self.classifier.fit(X, labels)
def route(self, query: str) -> list[str]:
X = self.vectorizer.transform([query])
return self.classifier.predict(X)[0]
FAQ
How many indexes should I maintain separately versus combining?
Keep indexes separate when document types have fundamentally different structures, different optimal chunking strategies, or different access control requirements. A rule of thumb: if you would use a different embedding model or chunk size for two document types, they belong in separate indexes.
Does multi-index RAG increase latency compared to single-index search?
If you search indexes in parallel, the latency equals the slowest single-index search plus the routing overhead (50-300ms for LLM routing, under 5ms for classifier routing). This is often comparable to searching one very large index.
How do I handle access control across indexes?
Enforce access control at the index level. Each user query should first determine which indexes the user has permission to search, then route only among permitted indexes. This is simpler and more secure than row-level filtering within a combined index.
#MultiIndexRAG #RAG #IndexRouting #VectorSearch #RelevanceNormalization #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.