Semantic Search for AI Agents: Embedding Models, Chunking Strategies, and Retrieval Optimization

Semantic Search Is the Foundation of Agent Intelligence

Every AI agent that accesses external knowledge relies on semantic search. When an agent needs to find relevant context — whether from a company knowledge base, product documentation, or historical conversation logs — it translates the query into a vector, searches for similar vectors, and retrieves the matching content. The quality of this retrieval directly determines the quality of the agent's response.

Three technical decisions control retrieval quality: the embedding model that converts text to vectors, the chunking strategy that splits documents into searchable units, and the retrieval pipeline that finds and ranks results. Getting any one of these wrong degrades the entire system. This guide provides the technical depth needed to make each decision correctly.

Embedding Model Selection

Embedding models are the neural networks that convert text into fixed-dimensional vectors. The choice of model affects semantic accuracy, supported languages, vector dimensionality (which affects storage cost and search speed), and maximum input length.

Leading Models in 2026

OpenAI text-embedding-3-large (3072 dimensions, 8191 token max input). The current quality leader for English text. Supports dimension reduction via the dimensions parameter — you can request 1536 or even 256 dimensions for faster search with a modest quality drop. Pricing: $0.13 per million tokens.

Cohere embed-v4 (1024 dimensions, 512 token max input). Excels at multilingual retrieval and has a unique search-document / search-query input type parameter that optimizes embeddings for asymmetric search. Best price-performance ratio for multilingual use cases.

Voyage AI voyage-3 (1024 dimensions, 16000 token max input). The long-context specialist. If your documents are long and you want to embed large chunks without splitting, Voyage is the strongest option. Also supports code embedding with a dedicated code model.

BGE-M3 (open source, 1024 dimensions, 8192 token max input). The best self-hosted option. Supports dense, sparse, and multi-vector retrieval in a single model. Run it on your own GPU with no API dependency.

from openai import OpenAI
import cohere
import numpy as np

class EmbeddingService:
    """Unified interface for multiple embedding providers."""

    def __init__(self, provider: str = "openai"):
        self.provider = provider
        if provider == "openai":
            self.client = OpenAI()
            self.model = "text-embedding-3-large"
            self.dimensions = 3072
        elif provider == "cohere":
            self.client = cohere.Client()
            self.model = "embed-v4"
            self.dimensions = 1024

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        if self.provider == "openai":
            response = self.client.embeddings.create(
                input=texts,
                model=self.model,
                dimensions=self.dimensions,
            )
            return [item.embedding for item in response.data]

        elif self.provider == "cohere":
            response = self.client.embed(
                texts=texts,
                model=self.model,
                input_type="search_document",
            )
            return response.embeddings

    def embed_query(self, text: str) -> list[float]:
        if self.provider == "openai":
            response = self.client.embeddings.create(
                input=[text],
                model=self.model,
                dimensions=self.dimensions,
            )
            return response.data[0].embedding

        elif self.provider == "cohere":
            response = self.client.embed(
                texts=[text],
                model=self.model,
                input_type="search_query",
            )
            return response.embeddings[0]

How to Benchmark for Your Domain

Do not trust generic benchmarks like MTEB. Embedding model performance varies dramatically by domain. A model that ranks first on general web text may rank third on legal documents or medical notes. Build a domain-specific evaluation set.

import numpy as np
from dataclasses import dataclass

@dataclass
class RetrievalTestCase:
    query: str
    relevant_doc_ids: list[str]

def evaluate_retrieval(
    embedding_service: EmbeddingService,
    test_cases: list[RetrievalTestCase],
    documents: dict[str, str],
    k: int = 5,
) -> dict:
    # Embed all documents
    doc_ids = list(documents.keys())
    doc_texts = list(documents.values())
    doc_embeddings = embedding_service.embed_documents(doc_texts)

    doc_matrix = np.array(doc_embeddings)
    doc_norms = np.linalg.norm(doc_matrix, axis=1, keepdims=True)
    doc_matrix_normed = doc_matrix / doc_norms

    recall_at_k = []
    mrr_scores = []

    for tc in test_cases:
        query_vec = np.array(embedding_service.embed_query(tc.query))
        query_normed = query_vec / np.linalg.norm(query_vec)

        scores = doc_matrix_normed @ query_normed
        top_k_indices = np.argsort(scores)[-k:][::-1]
        top_k_ids = [doc_ids[i] for i in top_k_indices]

        # Recall@k
        relevant_found = len(
            set(top_k_ids) & set(tc.relevant_doc_ids)
        )
        recall_at_k.append(relevant_found / len(tc.relevant_doc_ids))

        # MRR
        for rank, doc_id in enumerate(top_k_ids, 1):
            if doc_id in tc.relevant_doc_ids:
                mrr_scores.append(1.0 / rank)
                break
        else:
            mrr_scores.append(0.0)

    return {
        "recall_at_k": np.mean(recall_at_k),
        "mrr": np.mean(mrr_scores),
    }

Chunking Strategies

Chunking is how you split documents into searchable units. Get it wrong and your retrieval system either finds irrelevant fragments (chunks too small) or buries the answer in noise (chunks too large). There is no universal best chunk size — it depends on your document types, query patterns, and embedding model.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Fixed-Size Chunking with Overlap

The simplest strategy: split text into chunks of N tokens with M tokens of overlap. Overlap ensures that information at chunk boundaries is not lost.

from langchain.text_splitter import RecursiveCharacterTextSplitter

def fixed_size_chunking(
    text: str, chunk_size: int = 512, chunk_overlap: int = 50
) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["

", "
", ". ", " ", ""],
        length_function=len,
    )
    return splitter.split_text(text)

Good defaults: 400-600 characters for Q&A retrieval, 800-1200 characters for summarization retrieval. Overlap should be 10-15% of chunk size.

Semantic Chunking

Instead of splitting at arbitrary token boundaries, semantic chunking splits where the topic changes. It measures embedding similarity between consecutive sentences and splits where similarity drops below a threshold.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

def semantic_chunking(text: str) -> list[str]:
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
    chunker = SemanticChunker(
        embeddings,
        breakpoint_threshold_type="percentile",
        breakpoint_threshold_amount=85,
    )
    docs = chunker.create_documents([text])
    return [doc.page_content for doc in docs]

Semantic chunking produces chunks of variable size that align with topic boundaries. This improves retrieval precision because each chunk is topically coherent — you rarely get a chunk that starts talking about one thing and ends talking about another.

Hierarchical Chunking

For long documents, use a two-level hierarchy: large parent chunks (1500-2000 tokens) contain small child chunks (300-500 tokens). Search is performed against child chunks for precision, but the parent chunk is returned for context. This gives you the best of both worlds.

from dataclasses import dataclass

@dataclass
class HierarchicalChunk:
    parent_id: str
    child_id: str
    parent_content: str
    child_content: str

def hierarchical_chunking(
    text: str,
    parent_size: int = 1500,
    child_size: int = 400,
    child_overlap: int = 50,
) -> list[HierarchicalChunk]:
    # Split into parent chunks
    parent_splitter = RecursiveCharacterTextSplitter(
        chunk_size=parent_size, chunk_overlap=0
    )
    parents = parent_splitter.split_text(text)

    # Split each parent into children
    child_splitter = RecursiveCharacterTextSplitter(
        chunk_size=child_size, chunk_overlap=child_overlap
    )

    chunks = []
    for p_idx, parent in enumerate(parents):
        children = child_splitter.split_text(parent)
        for c_idx, child in enumerate(children):
            chunks.append(
                HierarchicalChunk(
                    parent_id=f"parent-{p_idx}",
                    child_id=f"parent-{p_idx}-child-{c_idx}",
                    parent_content=parent,
                    child_content=child,
                )
            )
    return chunks

Retrieval Optimization Techniques

Contextual Retrieval

Anthropic's contextual retrieval technique prepends a short context summary to each chunk before embedding. This dramatically improves retrieval because the chunk now carries context that would otherwise be lost during splitting.

async def add_context_to_chunks(
    chunks: list[str], full_document: str, llm
) -> list[str]:
    contextualized = []
    for chunk in chunks:
        prompt = f"""Given this document:
{full_document[:3000]}

And this specific chunk from it:
{chunk}

Write a 1-2 sentence context that explains where this chunk fits
in the overall document. Start with 'This chunk is about...'"""

        response = await llm.ainvoke(prompt)
        contextualized.append(
            f"{response.content}

{chunk}"
        )
    return contextualized

Query Expansion

Expand a single query into multiple formulations to improve recall. This is especially effective for short or ambiguous queries.

async def expand_query(query: str, llm, n_expansions: int = 3) -> list[str]:
    prompt = f"""Generate {n_expansions} alternative phrasings of this
search query. Each should capture the same intent but use different words.

Original query: {query}

Return only the alternative queries, one per line."""

    response = await llm.ainvoke(prompt)
    expansions = [q.strip() for q in response.content.strip().split("
") if q.strip()]
    return [query] + expansions[:n_expansions]

async def expanded_search(
    query: str, vector_store, llm, top_k: int = 5
) -> list:
    queries = await expand_query(query, llm)
    all_results = []
    seen_ids = set()

    for q in queries:
        results = vector_store.similarity_search(q, k=top_k)
        for r in results:
            doc_id = r.page_content[:100]
            if doc_id not in seen_ids:
                all_results.append(r)
                seen_ids.add(doc_id)

    return all_results[:top_k]

Hypothetical Document Embeddings (HyDE)

Instead of embedding the query directly, generate a hypothetical answer and embed that. The hypothesis is closer in embedding space to actual documents than the question is.

async def hyde_search(
    query: str, vector_store, llm, embedding_service, top_k: int = 5
) -> list:
    # Generate hypothetical answer
    prompt = f"""Write a detailed paragraph that would answer this question.
Write as if it is a passage from a reference document.

Question: {query}"""

    response = await llm.ainvoke(prompt)
    hypothesis = response.content

    # Embed the hypothesis instead of the query
    hyp_vector = embedding_service.embed_query(hypothesis)

    # Search with hypothesis embedding
    results = vector_store.similarity_search_by_vector(
        hyp_vector, k=top_k
    )
    return results

Putting It All Together: Production Pipeline

class ProductionRetrievalPipeline:
    def __init__(self, config: dict):
        self.embedding = EmbeddingService(config["embedding_provider"])
        self.vector_store = config["vector_store"]
        self.llm = config["llm"]
        self.use_hyde = config.get("use_hyde", False)
        self.use_expansion = config.get("use_expansion", True)
        self.use_reranking = config.get("use_reranking", True)

    async def ingest(self, documents: list[dict]):
        for doc in documents:
            # Step 1: Chunk
            chunks = semantic_chunking(doc["content"])

            # Step 2: Add context
            chunks = await add_context_to_chunks(
                chunks, doc["content"], self.llm
            )

            # Step 3: Embed and store
            vectors = self.embedding.embed_documents(chunks)
            self.vector_store.add(
                vectors=vectors,
                documents=chunks,
                metadatas=[doc["metadata"]] * len(chunks),
            )

    async def search(self, query: str, top_k: int = 5) -> list[str]:
        # Step 1: Optional query expansion
        if self.use_expansion:
            results = await expanded_search(
                query, self.vector_store, self.llm, top_k=20
            )
        else:
            results = self.vector_store.similarity_search(query, k=20)

        # Step 2: Optional re-ranking
        if self.use_reranking:
            reranker = ReRanker()
            results = reranker.rerank(
                query,
                [SearchResult(content=r.page_content, metadata=r.metadata, score=0)
                 for r in results],
                top_k=top_k,
            )
            return [r.content for r in results]

        return [r.page_content for r in results[:top_k]]

FAQ

What chunk size should I use for my specific use case?

Start with 500 characters and test. For factual Q&A (customer support, documentation), smaller chunks (300-500 characters) work best because answers are typically contained in a single paragraph. For analytical queries (research, summarization), larger chunks (800-1500 characters) provide more context. The most reliable approach is to build a test set of 50 queries with known answers, then benchmark different chunk sizes against recall at k=5. Most teams find their optimal size between 400 and 800 characters.

How much does embedding model quality actually affect retrieval?

Significantly. In controlled benchmarks, the gap between the best and worst mainstream embedding models is 15-20% recall at k=5. However, the gap between the top 3 models is only 2-4%. This means the choice between OpenAI, Cohere, and Voyage matters much less than the choice between any of these and a cheap or outdated model. Where embedding model choice matters most is multilingual retrieval (Cohere leads) and long-document retrieval (Voyage leads).

Should I use semantic chunking or fixed-size chunking?

Semantic chunking produces higher-quality chunks but is slower (requires embedding every sentence to find breakpoints) and non-deterministic (different runs may produce different splits). Use semantic chunking when document quality varies and topics shift frequently within documents. Use fixed-size chunking for homogeneous documents (product specs, legal clauses, API documentation) where the structure is already consistent. For most production systems, fixed-size chunking with a well-tuned size and 10% overlap provides 90% of the quality at 10% of the cost.

How do I evaluate whether my retrieval pipeline is actually good enough?

Build a golden test set: 100 queries paired with the document chunks that contain the correct answer. Measure recall at k=5 (what percentage of queries have the answer in the top 5 results) and MRR (mean reciprocal rank — how high the first correct result appears). Target recall at k=5 above 85% and MRR above 0.6. If you fall short, the improvement priority is: (1) fix chunking, (2) add re-ranking, (3) try query expansion, (4) switch embedding models. Most retrieval failures are caused by bad chunking, not bad embeddings.

#SemanticSearch #Embeddings #Chunking #RetrievalOptimization #RAG #VectorSearch #AIAgents #LLM