Building a Semantic Search Engine from Scratch: Embeddings, Indexing, and Retrieval

Why Semantic Search Matters

Traditional keyword search fails when users express the same idea with different words. Searching for "how to fix a leaking faucet" returns nothing if your documents say "repair a dripping tap." Semantic search solves this by comparing meaning rather than surface-level text, using dense vector embeddings to represent documents and queries in a shared mathematical space.

In this guide we will build a complete semantic search engine from the ground up: an embedding pipeline that converts documents into vectors, an approximate nearest neighbor (ANN) index for fast retrieval, and a query processing layer that ranks results by semantic similarity.

Architecture Overview

A semantic search system has three main components:

Embedding Pipeline — converts raw text into fixed-dimension vectors using a pre-trained model.
Vector Index — stores embeddings in a structure optimized for fast similarity lookups.
Query Processor — embeds the user query, searches the index, and returns ranked results.

Step 1: The Embedding Pipeline

We use the sentence-transformers library with the all-MiniLM-L6-v2 model, which produces 384-dimensional vectors and balances speed with quality.

from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Dict

class EmbeddingPipeline:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.dimension = self.model.get_sentence_embedding_dimension()

    def embed_documents(self, documents: List[Dict]) -> np.ndarray:
        """Embed a list of documents, combining title and body."""
        texts = []
        for doc in documents:
            combined = f"{doc['title']}. {doc['body']}"
            texts.append(combined)
        embeddings = self.model.encode(
            texts,
            show_progress_bar=True,
            batch_size=64,
            normalize_embeddings=True,
        )
        return np.array(embeddings, dtype="float32")

    def embed_query(self, query: str) -> np.ndarray:
        """Embed a single search query."""
        embedding = self.model.encode(
            [query],
            normalize_embeddings=True,
        )
        return np.array(embedding, dtype="float32")

The normalize_embeddings=True flag ensures all vectors have unit length, which means cosine similarity reduces to a simple dot product — a significant performance win.

Step 2: Building the FAISS Index

FAISS (Facebook AI Similarity Search) provides highly optimized ANN index structures. For datasets under a million documents, an IndexFlatIP (exact inner product) works well. For larger corpora, we use IndexIVFFlat which partitions the space into clusters.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import faiss
import pickle
import os

class VectorIndex:
    def __init__(self, dimension: int, use_ivf: bool = False, nlist: int = 100):
        self.dimension = dimension
        if use_ivf:
            quantizer = faiss.IndexFlatIP(dimension)
            self.index = faiss.IndexIVFFlat(
                quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT
            )
            self.needs_training = True
        else:
            self.index = faiss.IndexFlatIP(dimension)
            self.needs_training = False

    def build(self, embeddings: np.ndarray):
        """Add embeddings to the index."""
        if self.needs_training:
            self.index.train(embeddings)
        self.index.add(embeddings)

    def search(self, query_embedding: np.ndarray, top_k: int = 10):
        """Return top_k most similar document indices and scores."""
        scores, indices = self.index.search(query_embedding, top_k)
        return scores[0], indices[0]

    def save(self, path: str):
        faiss.write_index(self.index, path)

    def load(self, path: str):
        self.index = faiss.read_index(path)

Step 3: The Query Processor

The query processor ties everything together. It embeds the user query, searches the index, and maps results back to document metadata.

class SemanticSearchEngine:
    def __init__(self, documents: List[Dict]):
        self.documents = documents
        self.pipeline = EmbeddingPipeline()
        self.index = VectorIndex(self.pipeline.dimension)

        # Build the index
        embeddings = self.pipeline.embed_documents(documents)
        self.index.build(embeddings)

    def search(self, query: str, top_k: int = 5, min_score: float = 0.3):
        query_emb = self.pipeline.embed_query(query)
        scores, indices = self.index.search(query_emb, top_k)

        results = []
        for score, idx in zip(scores, indices):
            if idx == -1 or score < min_score:
                continue
            doc = self.documents[idx].copy()
            doc["score"] = float(score)
            results.append(doc)
        return results

# Usage
documents = [
    {"title": "Plumbing Repair Guide", "body": "How to fix a dripping tap..."},
    {"title": "Garden Watering Tips", "body": "Efficient irrigation methods..."},
]
engine = SemanticSearchEngine(documents)
results = engine.search("leaking faucet repair")
for r in results:
    print(f"{r['score']:.3f} — {r['title']}")

Searching for "leaking faucet repair" now correctly returns the plumbing guide even though those exact words never appear in the document.

Performance Considerations

For production deployments, consider these optimizations:

Batch embedding — process documents in batches of 64-128 to maximize GPU utilization.
Product quantization — use IndexIVFPQ to compress vectors from 1.5KB to 48 bytes each, enabling billion-scale search.
Pre-filtering — apply metadata filters before the vector search to reduce the candidate set.
Caching — cache frequent query embeddings to avoid re-encoding.

FAQ

What embedding model should I use for semantic search?

Start with all-MiniLM-L6-v2 for general English text. It offers excellent quality-to-speed ratio with 384 dimensions. For higher accuracy at the cost of speed, use all-mpnet-base-v2 (768 dimensions). For domain-specific needs like legal or medical text, fine-tune a base model on your domain corpus.

How does semantic search handle exact keyword matches?

Pure semantic search can sometimes miss exact matches that keyword search catches easily. The recommended approach is hybrid search: combine BM25 keyword scores with vector similarity scores using reciprocal rank fusion. This gives you the best of both worlds.

How many documents can FAISS handle on a single machine?

A flat index comfortably handles up to one million 384-dimensional vectors in about 1.5 GB of RAM. With product quantization (IndexIVFPQ), a single machine with 64 GB of RAM can index over 100 million documents while maintaining sub-10ms query latency.

#SemanticSearch #Embeddings #FAISS #InformationRetrieval #VectorSearch #AgenticAI #LearnAI #AIEngineering

Building a Semantic Search Engine from Scratch: Embeddings, Indexing, and Retrieval

Why Semantic Search Matters

Architecture Overview

Step 1: The Embedding Pipeline

Step 2: Building the FAISS Index

Step 3: The Query Processor

Performance Considerations

FAQ

What embedding model should I use for semantic search?

How does semantic search handle exact keyword matches?

How many documents can FAISS handle on a single machine?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding