Skip to content
Technology5 min read0 views

Semantic Search and Vector Databases: The Memory Layer for AI Agents

How vector databases and semantic search power AI agent memory, RAG systems, and knowledge retrieval with practical guidance on embedding models, indexing, and query strategies.

AI agents are only as capable as the information they can access. LLMs have broad general knowledge from training, but they lack access to private data, recent information, and domain-specific knowledge. Semantic search with vector databases bridges this gap by giving agents the ability to find relevant information based on meaning rather than keyword matching.

This capability underpins retrieval-augmented generation (RAG), agent long-term memory, and knowledge base search — three foundational patterns in production agent systems.

How Semantic Search Works

Embedding Models

Embedding models convert text into dense numerical vectors that capture semantic meaning. Similar texts produce vectors that are close together in the embedding space.

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-large",
    input="How do I reset my password?"
)
vector = response.data[0].embedding  # 3072-dimensional vector
Model Dimensions Max Tokens Strengths
OpenAI text-embedding-3-large 3072 8191 Best general-purpose, adjustable dimensions
Cohere embed-v4 1024 512 Strong multilingual support
Voyage voyage-3-large 1024 32000 Long document embedding
BGE-M3 (open source) 1024 8192 Free, competitive quality

Given a query vector, the database finds the most similar stored vectors using distance metrics:

  • Cosine similarity: Measures the angle between vectors. Most common, works well with normalized embeddings.
  • Euclidean distance (L2): Measures absolute distance. Sensitive to vector magnitude.
  • Dot product: Fastest computation. Equivalent to cosine similarity for normalized vectors.

Vector Database Options

Managed Services

  • Pinecone: Fully managed, serverless option with strong query performance. Good for teams that want to avoid infrastructure management.
  • Weaviate Cloud: Managed Weaviate with hybrid search (vector + keyword) built in.
  • MongoDB Atlas Vector Search: Vector search integrated into MongoDB, useful when your primary data store is already MongoDB.

Self-Hosted

  • pgvector (PostgreSQL): Adds vector operations to PostgreSQL. Ideal when you want to keep vector data alongside relational data without adding a new database.
  • Qdrant: Purpose-built vector database with advanced filtering and payload management.
  • Chroma: Lightweight, developer-friendly, commonly used for prototyping.
  • Milvus: High-performance, distributed vector database for large-scale deployments.

Choosing Between Them

For most teams starting out, pgvector is the pragmatic choice if you already use PostgreSQL — one fewer database to manage. Pinecone is appropriate when you want zero infrastructure overhead. Qdrant or Milvus make sense at scale when query performance and advanced filtering are critical.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

RAG Architecture with Vector Databases

The standard RAG pipeline:

  1. Indexing (offline): Chunk documents, generate embeddings, store in vector database with metadata
  2. Retrieval (online): Embed the user query, search for similar chunks, return top-K results
  3. Generation (online): Feed retrieved chunks as context to the LLM along with the user query

Chunking Strategies

How you split documents into chunks directly affects retrieval quality:

  • Fixed-size chunks (512-1024 tokens): Simple, consistent, but may split sentences or paragraphs
  • Semantic chunking: Split at paragraph or section boundaries to preserve meaning
  • Recursive splitting: Try larger chunks first, split smaller only when needed
  • Sliding window with overlap: Overlap of 10-20 percent prevents information loss at chunk boundaries

Improving Retrieval Quality

  • Hybrid search: Combine vector similarity with keyword (BM25) search. Keyword search catches exact matches that embeddings may miss.
  • Re-ranking: Use a cross-encoder model to re-rank the top 20-50 results from the initial retrieval. Cross-encoders are more accurate than bi-encoders but too slow for first-stage retrieval.
  • Metadata filtering: Filter by date, source, category, or other metadata before or during vector search to narrow results.
  • Query expansion: Use the LLM to generate multiple search queries from the original question, then merge results.

Agent Memory with Vector Databases

Beyond RAG, vector databases serve as long-term memory for agents:

  • Conversation history: Store past interactions with embeddings for retrieval when similar topics arise
  • Learned facts: Store information the agent has gathered during previous sessions
  • User preferences: Track user-specific context that should influence future interactions
# Store a memory
memory_text = "User prefers Python code examples over JavaScript"
embedding = embed(memory_text)
vector_db.upsert(id="mem-001", vector=embedding, metadata={
    "text": memory_text,
    "user_id": "user-123",
    "created_at": "2026-03-05"
})

# Retrieve relevant memories
query_embedding = embed("Show me how to parse JSON")
memories = vector_db.query(vector=query_embedding, filter={"user_id": "user-123"}, top_k=5)

Vector databases are foundational infrastructure for the agentic AI stack. Understanding their capabilities and limitations is essential for building agents that can access and reason over large knowledge bases effectively.

Sources: Pinecone Documentation | pgvector GitHub | MTEB Leaderboard

Share this article
N

NYC News

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.