RAG with Metadata Filtering: Narrowing Search with Structured Attributes
Learn how to use metadata filtering in RAG to narrow vector search results using structured attributes like document type, date ranges, departments, and access levels for more precise retrieval.
Why Metadata Filtering Matters
Vector similarity search finds semantically related content, but it has no concept of structured attributes. When a user asks "What was the Q3 2025 revenue?" the vector search might return revenue figures from any quarter because the numbers and language are all semantically similar. Metadata filtering solves this by restricting the search to documents tagged with the correct quarter, department, or document type before computing similarity.
Think of it as a WHERE clause for vector search. You get the precision of structured queries combined with the semantic understanding of embeddings.
Designing a Metadata Schema
Good metadata design starts with understanding how users will filter their searches. Here is a practical schema for a corporate knowledge base:
metadata_schema = {
"source": str, # "policies.md", "handbook.pdf"
"department": str, # "engineering", "hr", "finance"
"doc_type": str, # "policy", "tutorial", "report", "faq"
"created_date": str, # ISO format: "2025-09-15"
"last_updated": str, # ISO format: "2026-01-10"
"access_level": str, # "public", "internal", "confidential"
"version": int, # document version number
"author": str, # "jane.doe@company.com"
"product": str, # "platform", "mobile-app", "api"
}
Apply this metadata during the chunking and indexing phase:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
def index_document(content: str, metadata: dict, vectorstore):
"""Chunk a document and store with metadata."""
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_text(content)
# Every chunk inherits the parent document's metadata
metadatas = [metadata.copy() for _ in chunks]
vectorstore.add_texts(texts=chunks, metadatas=metadatas)
# Usage
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
persist_directory="./metadata_db",
embedding_function=embeddings,
collection_name="corp_docs",
)
index_document(
content="Enterprise refund policy: Full refunds available within 30 days...",
metadata={
"source": "refund-policy.md",
"department": "finance",
"doc_type": "policy",
"created_date": "2025-06-01",
"last_updated": "2026-01-15",
"access_level": "internal",
"product": "platform",
},
vectorstore=vectorstore,
)
Pre-Filtering vs Post-Filtering
There are two approaches to combining metadata filters with vector search:
Pre-filtering narrows the candidate set before computing similarity. Only documents matching the filter are considered. This is faster and more precise.
Post-filtering computes similarity across all documents, then removes results that do not match the filter. This can return fewer results than requested if many top-k results are filtered out.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Most vector databases use pre-filtering by default. Here is how it works in Chroma:
# Pre-filtering: only search within finance department policies
results = vectorstore.similarity_search(
query="What is the refund policy?",
k=5,
filter={
"$and": [
{"department": {"$eq": "finance"}},
{"doc_type": {"$eq": "policy"}},
]
},
)
for doc in results:
print(f"[{doc.metadata['source']}] {doc.page_content[:100]}...")
Filter Operators Across Vector Databases
Each vector database supports different filter syntax:
# --- Chroma ---
chroma_filter = {
"$and": [
{"department": {"$eq": "engineering"}},
{"created_date": {"$gte": "2025-01-01"}},
]
}
# --- Pinecone ---
pinecone_filter = {
"$and": [
{"department": {"$eq": "engineering"}},
{"created_date": {"$gte": "2025-01-01"}},
]
}
# --- pgvector (via SQL WHERE clause) ---
pgvector_query = """
SELECT id, content, 1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE metadata->>'department' = 'engineering'
AND metadata->>'created_date' >= '2025-01-01'
ORDER BY embedding <=> %s::vector
LIMIT 5
"""
# --- Weaviate ---
import weaviate.classes.query as wq
weaviate_filter = (
wq.Filter.by_property("department").equal("engineering")
& wq.Filter.by_property("created_date").greater_or_equal("2025-01-01")
)
Automatic Filter Extraction from Natural Language
Instead of requiring users to specify filters manually, use an LLM to extract structured filters from natural language queries:
from langchain_openai import ChatOpenAI
import json
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def extract_filters(query: str) -> dict:
"""Extract metadata filters from a natural language query."""
prompt = f"""Analyze this search query and extract any metadata filters.
Available filter fields:
- department: engineering, hr, finance, sales, support
- doc_type: policy, tutorial, report, faq, changelog
- product: platform, mobile-app, api
- access_level: public, internal, confidential
- date range: created_date or last_updated (ISO format)
Query: "{query}"
Return a JSON object with:
- "filters": dict of field->value pairs to filter on
- "search_query": the remaining natural language query for vector search
If no filters can be extracted, return empty filters dict.
Return ONLY valid JSON."""
response = llm.invoke(prompt)
return json.loads(response.content)
# Examples
result = extract_filters("What engineering policies were updated after January 2026?")
print(json.dumps(result, indent=2))
# {
# "filters": {
# "department": "engineering",
# "doc_type": "policy",
# "last_updated": {"$gte": "2026-01-01"}
# },
# "search_query": "engineering policies"
# }
Then use the extracted filters in your retrieval:
def filtered_rag_query(user_query: str, vectorstore, llm) -> dict:
"""Full RAG pipeline with automatic filter extraction."""
# Extract filters
parsed = extract_filters(user_query)
filters = parsed.get("filters", {})
search_query = parsed.get("search_query", user_query)
# Build Chroma filter
chroma_filter = None
if filters:
conditions = []
for key, value in filters.items():
if isinstance(value, dict):
conditions.append({key: value})
else:
conditions.append({key: {"$eq": value}})
if len(conditions) == 1:
chroma_filter = conditions[0]
elif len(conditions) > 1:
chroma_filter = {"$and": conditions}
# Retrieve with filters
results = vectorstore.similarity_search(
query=search_query,
k=5,
filter=chroma_filter,
)
return {
"results": results,
"filters_applied": filters,
"search_query": search_query,
}
Metadata for Access Control
In enterprise RAG, metadata filtering enforces access control. Different users should only see documents they are authorized to access:
def secure_retrieve(query: str, user_role: str, user_dept: str, vectorstore):
"""Retrieve documents respecting access control."""
access_levels = {
"admin": ["public", "internal", "confidential"],
"manager": ["public", "internal"],
"employee": ["public"],
}
allowed = access_levels.get(user_role, ["public"])
results = vectorstore.similarity_search(
query=query,
k=5,
filter={
"$and": [
{"access_level": {"$in": allowed}},
{"department": {"$in": [user_dept, "company-wide"]}},
]
},
)
return results
FAQ
Should I store metadata in the vector database or in a separate relational database?
For simple key-value metadata (department, type, date), storing it in the vector database is simpler and supports pre-filtering natively. For complex relational metadata (user permissions, organizational hierarchies, document relationships), store it in a relational database and use it to build filter conditions before querying the vector store. Many production systems use both: lightweight metadata in the vector DB for fast filtering, and rich relational data in PostgreSQL for complex access control.
How does metadata filtering affect search performance?
Pre-filtering narrows the search space, so it actually makes vector search faster by reducing the number of similarity comparisons. The tradeoff is that very restrictive filters might leave too few candidates, resulting in poor-quality matches. Monitor the number of candidates after filtering — if it drops below 50-100, your filters may be too narrow.
Can I filter by date ranges in vector databases?
Yes, but the implementation varies. Chroma and Pinecone support $gte, $lte operators on string fields, so store dates as ISO strings ("2026-01-15") which sort lexicographically. pgvector gives you full SQL date functions. Weaviate supports native date types with range filters. Always store dates in a consistent format during indexing.
#RAG #MetadataFiltering #VectorSearch #InformationRetrieval #SearchOptimization #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.