Multi-Index RAG: Searching Across Multiple Document Collections Simultaneously

Why One Index Is Not Enough

Real organizations do not store all their knowledge in a single place. Product documentation lives in Confluence, customer conversations sit in a CRM, financial data resides in data warehouses, and research papers are in a separate repository. Each source has different document structures, update frequencies, and access patterns.

A single vector index that ingests everything creates problems. Embedding models optimized for technical documentation perform poorly on conversational support tickets. Chunking strategies that work for structured reports break down on free-form emails. And when your index grows to millions of documents, retrieval precision degrades because unrelated domains pollute each other's embedding space.

Multi-index RAG solves this by maintaining separate, optimized indexes for each document collection and intelligently routing queries to the right ones.

Architecture of Multi-Index RAG

A multi-index RAG system has three components working together:

Index registry — Metadata about each collection: what it contains, when it was last updated, and what embedding model it uses
Query router — Determines which indexes are relevant for a given query
Result merger — Combines results from multiple indexes with normalized scoring

Building the Index Registry and Router

from dataclasses import dataclass, field
from openai import OpenAI

client = OpenAI()

@dataclass
class IndexConfig:
    name: str
    description: str
    vectorstore: object  # FAISS, Pinecone, etc.
    embedding_model: str
    doc_count: int
    domains: list[str] = field(default_factory=list)

class MultiIndexRAG:
    def __init__(self, indexes: list[IndexConfig]):
        self.indexes = {idx.name: idx for idx in indexes}
        self.index_descriptions = "\n".join(
            f"- {idx.name}: {idx.description} "
            f"(domains: {', '.join(idx.domains)})"
            for idx in indexes
        )

    def route_query(self, query: str) -> list[str]:
        """Use LLM to decide which indexes to search."""
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": f"""Given a user query, select which
                indexes to search. Available indexes:
                {self.index_descriptions}

                Return a JSON object with:
                - indexes: list of index names to search
                - reasoning: why these indexes were chosen"""
            }, {
                "role": "user",
                "content": query
            }],
            response_format={"type": "json_object"}
        )
        import json
        result = json.loads(
            response.choices[0].message.content
        )
        return result["indexes"]

Normalizing Scores Across Indexes

Different vector stores return scores on different scales. FAISS returns L2 distances (lower is better), Pinecone returns cosine similarity (higher is better), and Chroma returns its own scoring. You must normalize before merging:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

@dataclass
class ScoredResult:
    content: str
    source_index: str
    raw_score: float
    normalized_score: float

def normalize_scores(
    results: list[tuple[str, float]],
    score_type: str = "cosine",
) -> list[tuple[str, float]]:
    """Normalize scores to 0-1 range."""
    if not results:
        return []

    scores = [s for _, s in results]
    min_s, max_s = min(scores), max(scores)

    if max_s == min_s:
        return [(doc, 1.0) for doc, _ in results]

    if score_type == "distance":
        # Lower distance = better, invert the scale
        return [
            (doc, 1.0 - (s - min_s) / (max_s - min_s))
            for doc, s in results
        ]
    else:
        # Higher similarity = better
        return [
            (doc, (s - min_s) / (max_s - min_s))
            for doc, s in results
        ]

Full Search and Merge Pipeline

import asyncio
from concurrent.futures import ThreadPoolExecutor

class MultiIndexRAG:
    # ... (previous methods)

    def search_single_index(
        self, index_name: str, query: str, k: int = 5
    ) -> list[ScoredResult]:
        """Search a single index and normalize results."""
        config = self.indexes[index_name]
        raw_results = config.vectorstore.similarity_search_with_score(
            query, k=k
        )

        normalized = normalize_scores(
            [(doc.page_content, score)
             for doc, score in raw_results],
            score_type="cosine"
        )

        return [
            ScoredResult(
                content=content,
                source_index=index_name,
                raw_score=raw_results[i][1],
                normalized_score=norm_score,
            )
            for i, (content, norm_score) in enumerate(normalized)
        ]

    def search(
        self, query: str, k_per_index: int = 5, top_k: int = 10
    ) -> list[ScoredResult]:
        """Search across multiple indexes in parallel."""
        # Step 1: Route query to relevant indexes
        target_indexes = self.route_query(query)

        # Step 2: Search all selected indexes in parallel
        all_results = []
        with ThreadPoolExecutor() as executor:
            futures = {
                executor.submit(
                    self.search_single_index,
                    idx_name, query, k_per_index
                ): idx_name
                for idx_name in target_indexes
            }
            for future in futures:
                all_results.extend(future.result())

        # Step 3: Sort by normalized score and return top-K
        all_results.sort(
            key=lambda r: r.normalized_score, reverse=True
        )
        return all_results[:top_k]

Keyword-Based Routing as a Fast Alternative

LLM-based routing adds latency. For production systems with predictable query patterns, use keyword or classifier-based routing instead:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

class FastRouter:
    def __init__(self):
        self.vectorizer = TfidfVectorizer(max_features=5000)
        self.classifier = LogisticRegression(multi_label=True)

    def train(
        self,
        queries: list[str],
        labels: list[list[str]],
    ):
        """Train router on historical query-to-index mappings."""
        X = self.vectorizer.fit_transform(queries)
        # Multi-label binarize and train
        self.classifier.fit(X, labels)

    def route(self, query: str) -> list[str]:
        X = self.vectorizer.transform([query])
        return self.classifier.predict(X)[0]

FAQ

How many indexes should I maintain separately versus combining?

Keep indexes separate when document types have fundamentally different structures, different optimal chunking strategies, or different access control requirements. A rule of thumb: if you would use a different embedding model or chunk size for two document types, they belong in separate indexes.

Does multi-index RAG increase latency compared to single-index search?

If you search indexes in parallel, the latency equals the slowest single-index search plus the routing overhead (50-300ms for LLM routing, under 5ms for classifier routing). This is often comparable to searching one very large index.

How do I handle access control across indexes?

Enforce access control at the index level. Each user query should first determine which indexes the user has permission to search, then route only among permitted indexes. This is simpler and more secure than row-level filtering within a combined index.

#MultiIndexRAG #RAG #IndexRouting #VectorSearch #RelevanceNormalization #AgenticAI #LearnAI #AIEngineering

Multi-Index RAG: Searching Across Multiple Document Collections Simultaneously

Why One Index Is Not Enough

Architecture of Multi-Index RAG

Building the Index Registry and Router

Normalizing Scores Across Indexes

Full Search and Merge Pipeline

Keyword-Based Routing as a Fast Alternative

FAQ

How many indexes should I maintain separately versus combining?

Does multi-index RAG increase latency compared to single-index search?

How do I handle access control across indexes?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding