RAG with Metadata Filtering: Narrowing Search with Structured Attributes

Why Metadata Filtering Matters

Vector similarity search finds semantically related content, but it has no concept of structured attributes. When a user asks "What was the Q3 2025 revenue?" the vector search might return revenue figures from any quarter because the numbers and language are all semantically similar. Metadata filtering solves this by restricting the search to documents tagged with the correct quarter, department, or document type before computing similarity.

Think of it as a WHERE clause for vector search. You get the precision of structured queries combined with the semantic understanding of embeddings.

Designing a Metadata Schema

Good metadata design starts with understanding how users will filter their searches. Here is a practical schema for a corporate knowledge base:

metadata_schema = {
    "source": str,           # "policies.md", "handbook.pdf"
    "department": str,        # "engineering", "hr", "finance"
    "doc_type": str,         # "policy", "tutorial", "report", "faq"
    "created_date": str,     # ISO format: "2025-09-15"
    "last_updated": str,     # ISO format: "2026-01-10"
    "access_level": str,     # "public", "internal", "confidential"
    "version": int,          # document version number
    "author": str,           # "jane.doe@company.com"
    "product": str,          # "platform", "mobile-app", "api"
}

Apply this metadata during the chunking and indexing phase:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

def index_document(content: str, metadata: dict, vectorstore):
    """Chunk a document and store with metadata."""
    splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
    chunks = splitter.split_text(content)

    # Every chunk inherits the parent document's metadata
    metadatas = [metadata.copy() for _ in chunks]

    vectorstore.add_texts(texts=chunks, metadatas=metadatas)

# Usage
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    persist_directory="./metadata_db",
    embedding_function=embeddings,
    collection_name="corp_docs",
)

index_document(
    content="Enterprise refund policy: Full refunds available within 30 days...",
    metadata={
        "source": "refund-policy.md",
        "department": "finance",
        "doc_type": "policy",
        "created_date": "2025-06-01",
        "last_updated": "2026-01-15",
        "access_level": "internal",
        "product": "platform",
    },
    vectorstore=vectorstore,
)

Pre-Filtering vs Post-Filtering

There are two approaches to combining metadata filters with vector search:

Pre-filtering narrows the candidate set before computing similarity. Only documents matching the filter are considered. This is faster and more precise.

Post-filtering computes similarity across all documents, then removes results that do not match the filter. This can return fewer results than requested if many top-k results are filtered out.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Most vector databases use pre-filtering by default. Here is how it works in Chroma:

# Pre-filtering: only search within finance department policies
results = vectorstore.similarity_search(
    query="What is the refund policy?",
    k=5,
    filter={
        "$and": [
            {"department": {"$eq": "finance"}},
            {"doc_type": {"$eq": "policy"}},
        ]
    },
)

for doc in results:
    print(f"[{doc.metadata['source']}] {doc.page_content[:100]}...")

Filter Operators Across Vector Databases

Each vector database supports different filter syntax:

# --- Chroma ---
chroma_filter = {
    "$and": [
        {"department": {"$eq": "engineering"}},
        {"created_date": {"$gte": "2025-01-01"}},
    ]
}

# --- Pinecone ---
pinecone_filter = {
    "$and": [
        {"department": {"$eq": "engineering"}},
        {"created_date": {"$gte": "2025-01-01"}},
    ]
}

# --- pgvector (via SQL WHERE clause) ---
pgvector_query = """
    SELECT id, content, 1 - (embedding <=> %s::vector) AS similarity
    FROM documents
    WHERE metadata->>'department' = 'engineering'
      AND metadata->>'created_date' >= '2025-01-01'
    ORDER BY embedding <=> %s::vector
    LIMIT 5
"""

# --- Weaviate ---
import weaviate.classes.query as wq
weaviate_filter = (
    wq.Filter.by_property("department").equal("engineering")
    & wq.Filter.by_property("created_date").greater_or_equal("2025-01-01")
)

Automatic Filter Extraction from Natural Language

Instead of requiring users to specify filters manually, use an LLM to extract structured filters from natural language queries:

from langchain_openai import ChatOpenAI
import json

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def extract_filters(query: str) -> dict:
    """Extract metadata filters from a natural language query."""
    prompt = f"""Analyze this search query and extract any metadata filters.

Available filter fields:
- department: engineering, hr, finance, sales, support
- doc_type: policy, tutorial, report, faq, changelog
- product: platform, mobile-app, api
- access_level: public, internal, confidential
- date range: created_date or last_updated (ISO format)

Query: "{query}"

Return a JSON object with:
- "filters": dict of field->value pairs to filter on
- "search_query": the remaining natural language query for vector search

If no filters can be extracted, return empty filters dict.
Return ONLY valid JSON."""

    response = llm.invoke(prompt)
    return json.loads(response.content)

# Examples
result = extract_filters("What engineering policies were updated after January 2026?")
print(json.dumps(result, indent=2))
# {
#   "filters": {
#     "department": "engineering",
#     "doc_type": "policy",
#     "last_updated": {"$gte": "2026-01-01"}
#   },
#   "search_query": "engineering policies"
# }

Then use the extracted filters in your retrieval:

def filtered_rag_query(user_query: str, vectorstore, llm) -> dict:
    """Full RAG pipeline with automatic filter extraction."""
    # Extract filters
    parsed = extract_filters(user_query)
    filters = parsed.get("filters", {})
    search_query = parsed.get("search_query", user_query)

    # Build Chroma filter
    chroma_filter = None
    if filters:
        conditions = []
        for key, value in filters.items():
            if isinstance(value, dict):
                conditions.append({key: value})
            else:
                conditions.append({key: {"$eq": value}})
        if len(conditions) == 1:
            chroma_filter = conditions[0]
        elif len(conditions) > 1:
            chroma_filter = {"$and": conditions}

    # Retrieve with filters
    results = vectorstore.similarity_search(
        query=search_query,
        k=5,
        filter=chroma_filter,
    )

    return {
        "results": results,
        "filters_applied": filters,
        "search_query": search_query,
    }

Metadata for Access Control

In enterprise RAG, metadata filtering enforces access control. Different users should only see documents they are authorized to access:

def secure_retrieve(query: str, user_role: str, user_dept: str, vectorstore):
    """Retrieve documents respecting access control."""
    access_levels = {
        "admin": ["public", "internal", "confidential"],
        "manager": ["public", "internal"],
        "employee": ["public"],
    }

    allowed = access_levels.get(user_role, ["public"])

    results = vectorstore.similarity_search(
        query=query,
        k=5,
        filter={
            "$and": [
                {"access_level": {"$in": allowed}},
                {"department": {"$in": [user_dept, "company-wide"]}},
            ]
        },
    )
    return results

FAQ

Should I store metadata in the vector database or in a separate relational database?

For simple key-value metadata (department, type, date), storing it in the vector database is simpler and supports pre-filtering natively. For complex relational metadata (user permissions, organizational hierarchies, document relationships), store it in a relational database and use it to build filter conditions before querying the vector store. Many production systems use both: lightweight metadata in the vector DB for fast filtering, and rich relational data in PostgreSQL for complex access control.

How does metadata filtering affect search performance?

Pre-filtering narrows the search space, so it actually makes vector search faster by reducing the number of similarity comparisons. The tradeoff is that very restrictive filters might leave too few candidates, resulting in poor-quality matches. Monitor the number of candidates after filtering — if it drops below 50-100, your filters may be too narrow.

Can I filter by date ranges in vector databases?

Yes, but the implementation varies. Chroma and Pinecone support $gte, $lte operators on string fields, so store dates as ISO strings ("2026-01-15") which sort lexicographically. pgvector gives you full SQL date functions. Weaviate supports native date types with range filters. Always store dates in a consistent format during indexing.

#RAG #MetadataFiltering #VectorSearch #InformationRetrieval #SearchOptimization #AgenticAI #LearnAI #AIEngineering

RAG with Metadata Filtering: Narrowing Search with Structured Attributes

Why Metadata Filtering Matters

Designing a Metadata Schema

Pre-Filtering vs Post-Filtering

Filter Operators Across Vector Databases

Automatic Filter Extraction from Natural Language

Metadata for Access Control

FAQ

Should I store metadata in the vector database or in a separate relational database?

How does metadata filtering affect search performance?

Can I filter by date ranges in vector databases?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding