Agentic AI with Vector Databases: Building Semantic Search and RAG Agents
Build RAG-powered agentic AI with vector databases. Compare Pinecone, Weaviate, Chroma, and pgvector for semantic search agent systems.
Why Agents Need Vector Databases
Language models have impressive general knowledge but lack specific, up-to-date information about your business, products, and customers. Retrieval-Augmented Generation (RAG) bridges this gap by retrieving relevant documents from a knowledge base and injecting them into the agent's context before it generates a response.
Vector databases are purpose-built for the similarity search that RAG requires. They store document embeddings — dense numerical representations of text meaning — and efficiently retrieve the most semantically similar documents for any query. When a user asks your agent a question, the system embeds the question, searches the vector database for similar content, and provides the top results as context for the agent to reason with.
This guide covers the full pipeline: embedding generation, index design, hybrid search strategies, RAG agent patterns, vector database comparison, and chunking strategies.
Embedding Generation
Embeddings convert text into fixed-length numerical vectors where semantic similarity in text space maps to geometric proximity in vector space. Documents about similar topics produce embeddings that are close together.
Choosing an Embedding Model
| Model | Dimensions | Context Window | Speed | Quality |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8,191 tokens | Fast (API) | Excellent |
| OpenAI text-embedding-3-small | 1536 | 8,191 tokens | Fast (API) | Good |
| Cohere embed-v3 | 1024 | 512 tokens | Fast (API) | Very Good |
| BGE-large-en-v1.5 | 1024 | 512 tokens | Self-hosted | Very Good |
| all-MiniLM-L6-v2 | 384 | 256 tokens | Self-hosted, fast | Acceptable |
For production agent systems, OpenAI's text-embedding-3-large provides the best quality-to-convenience ratio. For self-hosted deployments where data cannot leave your infrastructure, BGE-large or similar open-source models are the standard choice.
Generating Embeddings
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def generate_embedding(text: str) -> list[float]:
response = await client.embeddings.create(
model="text-embedding-3-large",
input=text,
)
return response.data[0].embedding
async def generate_embeddings_batch(
texts: list[str],
batch_size: int = 100
) -> list[list[float]]:
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = await client.embeddings.create(
model="text-embedding-3-large",
input=batch,
)
all_embeddings.extend([d.embedding for d in response.data])
return all_embeddings
Index Design for Agent Workloads
Vector database index design depends on your query patterns, data volume, and latency requirements.
Metadata Filtering
Agent queries are rarely pure similarity search. They typically combine semantic similarity with metadata filters — "find documents similar to this query that were published in the last 30 days and belong to the 'billing' category."
# Pinecone example with metadata filtering
results = index.query(
vector=query_embedding,
top_k=10,
filter={
"category": {"$eq": "billing"},
"published_date": {"$gte": "2026-01-01"},
"document_type": {"$in": ["faq", "guide", "policy"]},
},
include_metadata=True,
)
Design your metadata schema upfront. Common metadata fields for agent knowledge bases include document category or type, publication and last-updated dates, source system, access level or tenant ID, and language.
Namespace Separation
For multi-agent systems, use namespaces or separate collections to isolate different knowledge domains. A customer support agent's knowledge base should not be mixed with an internal HR agent's knowledge base.
Hybrid Search: Keyword + Semantic
Pure semantic search sometimes misses exact matches. If a user asks about "order #12345" the semantic embedding captures the concept of "asking about an order" but may not prioritize the exact order number match. Hybrid search combines semantic similarity with keyword (BM25) matching for better results.
class HybridSearcher:
def __init__(self, vector_store, keyword_index):
self.vector_store = vector_store
self.keyword_index = keyword_index
async def search(
self,
query: str,
query_embedding: list[float],
top_k: int = 10,
semantic_weight: float = 0.7,
keyword_weight: float = 0.3,
) -> list[dict]:
# Parallel retrieval
semantic_results, keyword_results = await asyncio.gather(
self.vector_store.query(query_embedding, top_k=top_k * 2),
self.keyword_index.search(query, top_k=top_k * 2),
)
# Score fusion using Reciprocal Rank Fusion (RRF)
scores: dict[str, float] = {}
for rank, result in enumerate(semantic_results):
doc_id = result["id"]
scores[doc_id] = scores.get(doc_id, 0) + (
semantic_weight / (rank + 60)
)
for rank, result in enumerate(keyword_results):
doc_id = result["id"]
scores[doc_id] = scores.get(doc_id, 0) + (
keyword_weight / (rank + 60)
)
# Sort by fused score and return top_k
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [{"id": doc_id, "score": score} for doc_id, score in ranked[:top_k]]
RAG Agent Patterns
There are several architectural patterns for integrating RAG into agent systems.
Tool-Based RAG
The agent has a "search knowledge base" tool that it calls when it needs information. This is the most flexible pattern because the agent decides when to search and what to search for.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
async def search_knowledge_base(
query: str,
category: str | None = None,
max_results: int = 5,
) -> list[dict]:
"""
Search the company knowledge base for relevant information.
Args:
query: Natural language search query
category: Optional filter by category (billing, technical, policy)
max_results: Number of results to return (1-10)
"""
embedding = await generate_embedding(query)
filters = {}
if category:
filters["category"] = {"$eq": category}
results = await vector_store.query(
vector=embedding,
top_k=max_results,
filter=filters if filters else None,
include_metadata=True,
)
return [
{
"title": r.metadata["title"],
"content": r.metadata["content"],
"source": r.metadata["source_url"],
"relevance_score": r.score,
}
for r in results.matches
]
Automatic RAG (Pre-Retrieval)
Every user message triggers a retrieval step before the agent processes it. The retrieved documents are injected into the system prompt or user context automatically.
async def process_with_rag(
user_message: str,
conversation_history: list[dict],
agent_system_prompt: str,
) -> str:
# Always retrieve context
embedding = await generate_embedding(user_message)
context_docs = await vector_store.query(vector=embedding, top_k=5)
# Build augmented prompt
context_block = "\n\n".join(
f"[Source: {doc.metadata['title']}]\n{doc.metadata['content']}"
for doc in context_docs.matches
if doc.score > 0.7 # Relevance threshold
)
augmented_system = f"""{agent_system_prompt}
## Relevant Knowledge Base Context
{context_block}
Use the above context to answer the user's question. If the context
does not contain relevant information, say so rather than guessing.
"""
response = await llm.chat(
system=augmented_system,
messages=conversation_history + [{"role": "user", "content": user_message}],
)
return response
Agentic RAG with Re-Ranking
The agent performs an initial retrieval, evaluates the results, and optionally refines the query and searches again. This iterative approach handles complex questions that require multiple retrieval passes.
CallSphere's IT helpdesk RAG system uses this agentic approach. When a support ticket comes in, the agent first searches for similar resolved tickets, evaluates whether the resolutions are applicable, and if not, searches the technical documentation with a refined query derived from its analysis of why the initial results were insufficient.
Vector Database Comparison
| Feature | Pinecone | Weaviate | Chroma | pgvector |
|---|---|---|---|---|
| Hosting | Managed cloud | Self-hosted or cloud | Self-hosted or embedded | PostgreSQL extension |
| Scale | Billions of vectors | Hundreds of millions | Millions | Millions |
| Hybrid search | Sparse + dense | BM25 + vector built-in | Basic metadata | Full PostgreSQL text search |
| Metadata filtering | Rich filters | GraphQL filters | Where clauses | SQL WHERE |
| Latency (p99) | < 50ms | < 100ms | < 50ms (embedded) | Varies by index type |
| Pricing | Per-usage | Free (self-hosted) | Free (open-source) | Free (PostgreSQL) |
| Best for | Production SaaS | Feature-rich self-hosted | Prototyping, small scale | Teams already on PostgreSQL |
Pinecone is the safe choice for production SaaS applications. It is fully managed, scales automatically, and provides consistent low-latency queries.
Weaviate is ideal for teams that want rich features (built-in hybrid search, GraphQL API) and are comfortable managing infrastructure.
Chroma is excellent for prototyping and small-scale applications. Its embedded mode means zero infrastructure overhead during development.
pgvector is the pragmatic choice for teams already running PostgreSQL. It avoids adding a new database to your stack and supports vector search through familiar SQL queries. Performance is adequate for knowledge bases under a few million documents.
Chunking Strategies
How you split documents into chunks for embedding significantly impacts retrieval quality.
Fixed-Size Chunking
Split documents into chunks of N tokens with M token overlap. Simple but ignores document structure.
Semantic Chunking
Split on natural boundaries — paragraphs, sections, headings. Preserves semantic coherence within chunks.
Hierarchical Chunking
Create embeddings at multiple granularities — document summaries, section summaries, and paragraph-level chunks. Search at the appropriate level based on query type. Broad questions match document summaries; specific questions match paragraph chunks.
class HierarchicalChunker:
def chunk_document(self, document: str, metadata: dict) -> list[dict]:
chunks = []
# Level 1: Document summary
summary = self.summarize(document)
chunks.append({
"content": summary,
"level": "document",
"metadata": {**metadata, "chunk_level": "summary"},
})
# Level 2: Section-level chunks
sections = self.split_by_headings(document)
for i, section in enumerate(sections):
chunks.append({
"content": section["content"],
"level": "section",
"metadata": {
**metadata,
"chunk_level": "section",
"section_title": section["heading"],
"section_index": i,
},
})
# Level 3: Paragraph-level chunks within sections
paragraphs = self.split_by_paragraphs(section["content"])
for j, para in enumerate(paragraphs):
if len(para.split()) > 30: # Skip very short paragraphs
chunks.append({
"content": para,
"level": "paragraph",
"metadata": {
**metadata,
"chunk_level": "paragraph",
"section_title": section["heading"],
"paragraph_index": j,
},
})
return chunks
Frequently Asked Questions
What is the difference between RAG and fine-tuning for agent knowledge?
RAG retrieves relevant information at query time and injects it into the agent's context. Fine-tuning modifies the model's weights to encode knowledge directly. RAG is better for frequently changing information (product catalogs, policies, knowledge bases) because you update the vector database without retraining. Fine-tuning is better for teaching the model new behaviors, styles, or domain-specific reasoning patterns that do not change frequently.
How many documents can a vector database handle for a RAG agent?
Modern vector databases scale to billions of vectors. Pinecone and Weaviate handle hundreds of millions of vectors in production. For most agent applications, the practical limit is not the vector database but the quality of your chunking and embedding — poorly chunked documents produce poor retrieval regardless of database scale.
How do you evaluate RAG quality for agents?
Measure retrieval quality (are the right documents being retrieved?) and generation quality (does the agent use the retrieved context correctly?). Key metrics include recall at k (what fraction of relevant documents appear in the top k results), precision at k (what fraction of retrieved documents are relevant), faithfulness (does the agent's response align with the retrieved context?), and answer relevancy (does the response actually answer the question?).
Should I use pgvector or a dedicated vector database?
Use pgvector if you are already running PostgreSQL and your knowledge base is under 2-3 million documents. The operational simplicity of not adding another database to your stack is significant. Switch to a dedicated vector database when you need higher scale, faster query latency at high concurrency, or advanced features like built-in hybrid search.
How does CallSphere use RAG in its agent products?
CallSphere's IT helpdesk product uses an agentic RAG architecture where the support agent searches a vector database of resolved tickets and technical documentation. The agent performs iterative retrieval — if initial results are insufficient, it reformulates the query and searches again. This approach achieves higher resolution rates than single-pass RAG because complex issues often require synthesizing information from multiple knowledge base articles.
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.