Agentic AI with Vector Databases: Building Semantic Search and RAG Agents

Why Agents Need Vector Databases

Language models have impressive general knowledge but lack specific, up-to-date information about your business, products, and customers. Retrieval-Augmented Generation (RAG) bridges this gap by retrieving relevant documents from a knowledge base and injecting them into the agent's context before it generates a response.

Vector databases are purpose-built for the similarity search that RAG requires. They store document embeddings — dense numerical representations of text meaning — and efficiently retrieve the most semantically similar documents for any query. When a user asks your agent a question, the system embeds the question, searches the vector database for similar content, and provides the top results as context for the agent to reason with.

This guide covers the full pipeline: embedding generation, index design, hybrid search strategies, RAG agent patterns, vector database comparison, and chunking strategies.

Embedding Generation

Embeddings convert text into fixed-length numerical vectors where semantic similarity in text space maps to geometric proximity in vector space. Documents about similar topics produce embeddings that are close together.

Choosing an Embedding Model

Model	Dimensions	Context Window	Speed	Quality
OpenAI text-embedding-3-large	3072	8,191 tokens	Fast (API)	Excellent
OpenAI text-embedding-3-small	1536	8,191 tokens	Fast (API)	Good
Cohere embed-v3	1024	512 tokens	Fast (API)	Very Good
BGE-large-en-v1.5	1024	512 tokens	Self-hosted	Very Good
all-MiniLM-L6-v2	384	256 tokens	Self-hosted, fast	Acceptable

For production agent systems, OpenAI's text-embedding-3-large provides the best quality-to-convenience ratio. For self-hosted deployments where data cannot leave your infrastructure, BGE-large or similar open-source models are the standard choice.

Generating Embeddings

from openai import AsyncOpenAI

client = AsyncOpenAI()

async def generate_embedding(text: str) -> list[float]:
    response = await client.embeddings.create(
        model="text-embedding-3-large",
        input=text,
    )
    return response.data[0].embedding

async def generate_embeddings_batch(
    texts: list[str],
    batch_size: int = 100
) -> list[list[float]]:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = await client.embeddings.create(
            model="text-embedding-3-large",
            input=batch,
        )
        all_embeddings.extend([d.embedding for d in response.data])
    return all_embeddings

Index Design for Agent Workloads

Vector database index design depends on your query patterns, data volume, and latency requirements.

Metadata Filtering

Agent queries are rarely pure similarity search. They typically combine semantic similarity with metadata filters — "find documents similar to this query that were published in the last 30 days and belong to the 'billing' category."

# Pinecone example with metadata filtering
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "category": {"$eq": "billing"},
        "published_date": {"$gte": "2026-01-01"},
        "document_type": {"$in": ["faq", "guide", "policy"]},
    },
    include_metadata=True,
)

Design your metadata schema upfront. Common metadata fields for agent knowledge bases include document category or type, publication and last-updated dates, source system, access level or tenant ID, and language.

Namespace Separation

For multi-agent systems, use namespaces or separate collections to isolate different knowledge domains. A customer support agent's knowledge base should not be mixed with an internal HR agent's knowledge base.

Hybrid Search: Keyword + Semantic

Pure semantic search sometimes misses exact matches. If a user asks about "order #12345" the semantic embedding captures the concept of "asking about an order" but may not prioritize the exact order number match. Hybrid search combines semantic similarity with keyword (BM25) matching for better results.

class HybridSearcher:
    def __init__(self, vector_store, keyword_index):
        self.vector_store = vector_store
        self.keyword_index = keyword_index

    async def search(
        self,
        query: str,
        query_embedding: list[float],
        top_k: int = 10,
        semantic_weight: float = 0.7,
        keyword_weight: float = 0.3,
    ) -> list[dict]:
        # Parallel retrieval
        semantic_results, keyword_results = await asyncio.gather(
            self.vector_store.query(query_embedding, top_k=top_k * 2),
            self.keyword_index.search(query, top_k=top_k * 2),
        )

        # Score fusion using Reciprocal Rank Fusion (RRF)
        scores: dict[str, float] = {}
        for rank, result in enumerate(semantic_results):
            doc_id = result["id"]
            scores[doc_id] = scores.get(doc_id, 0) + (
                semantic_weight / (rank + 60)
            )
        for rank, result in enumerate(keyword_results):
            doc_id = result["id"]
            scores[doc_id] = scores.get(doc_id, 0) + (
                keyword_weight / (rank + 60)
            )

        # Sort by fused score and return top_k
        ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        return [{"id": doc_id, "score": score} for doc_id, score in ranked[:top_k]]

RAG Agent Patterns

There are several architectural patterns for integrating RAG into agent systems.

Tool-Based RAG

The agent has a "search knowledge base" tool that it calls when it needs information. This is the most flexible pattern because the agent decides when to search and what to search for.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

async def search_knowledge_base(
    query: str,
    category: str | None = None,
    max_results: int = 5,
) -> list[dict]:
    """
    Search the company knowledge base for relevant information.

    Args:
        query: Natural language search query
        category: Optional filter by category (billing, technical, policy)
        max_results: Number of results to return (1-10)
    """
    embedding = await generate_embedding(query)
    filters = {}
    if category:
        filters["category"] = {"$eq": category}

    results = await vector_store.query(
        vector=embedding,
        top_k=max_results,
        filter=filters if filters else None,
        include_metadata=True,
    )

    return [
        {
            "title": r.metadata["title"],
            "content": r.metadata["content"],
            "source": r.metadata["source_url"],
            "relevance_score": r.score,
        }
        for r in results.matches
    ]

Automatic RAG (Pre-Retrieval)

Every user message triggers a retrieval step before the agent processes it. The retrieved documents are injected into the system prompt or user context automatically.

async def process_with_rag(
    user_message: str,
    conversation_history: list[dict],
    agent_system_prompt: str,
) -> str:
    # Always retrieve context
    embedding = await generate_embedding(user_message)
    context_docs = await vector_store.query(vector=embedding, top_k=5)

    # Build augmented prompt
    context_block = "\n\n".join(
        f"[Source: {doc.metadata['title']}]\n{doc.metadata['content']}"
        for doc in context_docs.matches
        if doc.score > 0.7  # Relevance threshold
    )

    augmented_system = f"""{agent_system_prompt}

## Relevant Knowledge Base Context
{context_block}

Use the above context to answer the user's question. If the context
does not contain relevant information, say so rather than guessing.
"""

    response = await llm.chat(
        system=augmented_system,
        messages=conversation_history + [{"role": "user", "content": user_message}],
    )
    return response

Agentic RAG with Re-Ranking

The agent performs an initial retrieval, evaluates the results, and optionally refines the query and searches again. This iterative approach handles complex questions that require multiple retrieval passes.

CallSphere's IT helpdesk RAG system uses this agentic approach. When a support ticket comes in, the agent first searches for similar resolved tickets, evaluates whether the resolutions are applicable, and if not, searches the technical documentation with a refined query derived from its analysis of why the initial results were insufficient.

Vector Database Comparison

Feature	Pinecone	Weaviate	Chroma	pgvector
Hosting	Managed cloud	Self-hosted or cloud	Self-hosted or embedded	PostgreSQL extension
Scale	Billions of vectors	Hundreds of millions	Millions	Millions
Hybrid search	Sparse + dense	BM25 + vector built-in	Basic metadata	Full PostgreSQL text search
Metadata filtering	Rich filters	GraphQL filters	Where clauses	SQL WHERE
Latency (p99)	< 50ms	< 100ms	< 50ms (embedded)	Varies by index type
Pricing	Per-usage	Free (self-hosted)	Free (open-source)	Free (PostgreSQL)
Best for	Production SaaS	Feature-rich self-hosted	Prototyping, small scale	Teams already on PostgreSQL

Pinecone is the safe choice for production SaaS applications. It is fully managed, scales automatically, and provides consistent low-latency queries.

Weaviate is ideal for teams that want rich features (built-in hybrid search, GraphQL API) and are comfortable managing infrastructure.

Chroma is excellent for prototyping and small-scale applications. Its embedded mode means zero infrastructure overhead during development.

pgvector is the pragmatic choice for teams already running PostgreSQL. It avoids adding a new database to your stack and supports vector search through familiar SQL queries. Performance is adequate for knowledge bases under a few million documents.

Chunking Strategies

How you split documents into chunks for embedding significantly impacts retrieval quality.

Fixed-Size Chunking

Split documents into chunks of N tokens with M token overlap. Simple but ignores document structure.

Semantic Chunking

Split on natural boundaries — paragraphs, sections, headings. Preserves semantic coherence within chunks.

Hierarchical Chunking

Create embeddings at multiple granularities — document summaries, section summaries, and paragraph-level chunks. Search at the appropriate level based on query type. Broad questions match document summaries; specific questions match paragraph chunks.

class HierarchicalChunker:
    def chunk_document(self, document: str, metadata: dict) -> list[dict]:
        chunks = []

        # Level 1: Document summary
        summary = self.summarize(document)
        chunks.append({
            "content": summary,
            "level": "document",
            "metadata": {**metadata, "chunk_level": "summary"},
        })

        # Level 2: Section-level chunks
        sections = self.split_by_headings(document)
        for i, section in enumerate(sections):
            chunks.append({
                "content": section["content"],
                "level": "section",
                "metadata": {
                    **metadata,
                    "chunk_level": "section",
                    "section_title": section["heading"],
                    "section_index": i,
                },
            })

            # Level 3: Paragraph-level chunks within sections
            paragraphs = self.split_by_paragraphs(section["content"])
            for j, para in enumerate(paragraphs):
                if len(para.split()) > 30:  # Skip very short paragraphs
                    chunks.append({
                        "content": para,
                        "level": "paragraph",
                        "metadata": {
                            **metadata,
                            "chunk_level": "paragraph",
                            "section_title": section["heading"],
                            "paragraph_index": j,
                        },
                    })

        return chunks

Frequently Asked Questions

What is the difference between RAG and fine-tuning for agent knowledge?

RAG retrieves relevant information at query time and injects it into the agent's context. Fine-tuning modifies the model's weights to encode knowledge directly. RAG is better for frequently changing information (product catalogs, policies, knowledge bases) because you update the vector database without retraining. Fine-tuning is better for teaching the model new behaviors, styles, or domain-specific reasoning patterns that do not change frequently.

How many documents can a vector database handle for a RAG agent?

Modern vector databases scale to billions of vectors. Pinecone and Weaviate handle hundreds of millions of vectors in production. For most agent applications, the practical limit is not the vector database but the quality of your chunking and embedding — poorly chunked documents produce poor retrieval regardless of database scale.

How do you evaluate RAG quality for agents?

Measure retrieval quality (are the right documents being retrieved?) and generation quality (does the agent use the retrieved context correctly?). Key metrics include recall at k (what fraction of relevant documents appear in the top k results), precision at k (what fraction of retrieved documents are relevant), faithfulness (does the agent's response align with the retrieved context?), and answer relevancy (does the response actually answer the question?).

Should I use pgvector or a dedicated vector database?

Use pgvector if you are already running PostgreSQL and your knowledge base is under 2-3 million documents. The operational simplicity of not adding another database to your stack is significant. Switch to a dedicated vector database when you need higher scale, faster query latency at high concurrency, or advanced features like built-in hybrid search.

How does CallSphere use RAG in its agent products?

CallSphere's IT helpdesk product uses an agentic RAG architecture where the support agent searches a vector database of resolved tickets and technical documentation. The agent performs iterative retrieval — if initial results are insufficient, it reformulates the query and searches again. This approach achieves higher resolution rates than single-pass RAG because complex issues often require synthesizing information from multiple knowledge base articles.