Building a Semantic Search Engine from Scratch: Embeddings, Indexing, and Retrieval
Learn how to build a complete semantic search engine from scratch using sentence embeddings, approximate nearest neighbor indexing, and a query processing pipeline that returns relevant results by meaning rather than keywords.
Why Semantic Search Matters
Traditional keyword search fails when users express the same idea with different words. Searching for "how to fix a leaking faucet" returns nothing if your documents say "repair a dripping tap." Semantic search solves this by comparing meaning rather than surface-level text, using dense vector embeddings to represent documents and queries in a shared mathematical space.
In this guide we will build a complete semantic search engine from the ground up: an embedding pipeline that converts documents into vectors, an approximate nearest neighbor (ANN) index for fast retrieval, and a query processing layer that ranks results by semantic similarity.
Architecture Overview
A semantic search system has three main components:
- Embedding Pipeline — converts raw text into fixed-dimension vectors using a pre-trained model.
- Vector Index — stores embeddings in a structure optimized for fast similarity lookups.
- Query Processor — embeds the user query, searches the index, and returns ranked results.
Step 1: The Embedding Pipeline
We use the sentence-transformers library with the all-MiniLM-L6-v2 model, which produces 384-dimensional vectors and balances speed with quality.
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Dict
class EmbeddingPipeline:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
self.dimension = self.model.get_sentence_embedding_dimension()
def embed_documents(self, documents: List[Dict]) -> np.ndarray:
"""Embed a list of documents, combining title and body."""
texts = []
for doc in documents:
combined = f"{doc['title']}. {doc['body']}"
texts.append(combined)
embeddings = self.model.encode(
texts,
show_progress_bar=True,
batch_size=64,
normalize_embeddings=True,
)
return np.array(embeddings, dtype="float32")
def embed_query(self, query: str) -> np.ndarray:
"""Embed a single search query."""
embedding = self.model.encode(
[query],
normalize_embeddings=True,
)
return np.array(embedding, dtype="float32")
The normalize_embeddings=True flag ensures all vectors have unit length, which means cosine similarity reduces to a simple dot product — a significant performance win.
Step 2: Building the FAISS Index
FAISS (Facebook AI Similarity Search) provides highly optimized ANN index structures. For datasets under a million documents, an IndexFlatIP (exact inner product) works well. For larger corpora, we use IndexIVFFlat which partitions the space into clusters.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import faiss
import pickle
import os
class VectorIndex:
def __init__(self, dimension: int, use_ivf: bool = False, nlist: int = 100):
self.dimension = dimension
if use_ivf:
quantizer = faiss.IndexFlatIP(dimension)
self.index = faiss.IndexIVFFlat(
quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT
)
self.needs_training = True
else:
self.index = faiss.IndexFlatIP(dimension)
self.needs_training = False
def build(self, embeddings: np.ndarray):
"""Add embeddings to the index."""
if self.needs_training:
self.index.train(embeddings)
self.index.add(embeddings)
def search(self, query_embedding: np.ndarray, top_k: int = 10):
"""Return top_k most similar document indices and scores."""
scores, indices = self.index.search(query_embedding, top_k)
return scores[0], indices[0]
def save(self, path: str):
faiss.write_index(self.index, path)
def load(self, path: str):
self.index = faiss.read_index(path)
Step 3: The Query Processor
The query processor ties everything together. It embeds the user query, searches the index, and maps results back to document metadata.
class SemanticSearchEngine:
def __init__(self, documents: List[Dict]):
self.documents = documents
self.pipeline = EmbeddingPipeline()
self.index = VectorIndex(self.pipeline.dimension)
# Build the index
embeddings = self.pipeline.embed_documents(documents)
self.index.build(embeddings)
def search(self, query: str, top_k: int = 5, min_score: float = 0.3):
query_emb = self.pipeline.embed_query(query)
scores, indices = self.index.search(query_emb, top_k)
results = []
for score, idx in zip(scores, indices):
if idx == -1 or score < min_score:
continue
doc = self.documents[idx].copy()
doc["score"] = float(score)
results.append(doc)
return results
# Usage
documents = [
{"title": "Plumbing Repair Guide", "body": "How to fix a dripping tap..."},
{"title": "Garden Watering Tips", "body": "Efficient irrigation methods..."},
]
engine = SemanticSearchEngine(documents)
results = engine.search("leaking faucet repair")
for r in results:
print(f"{r['score']:.3f} — {r['title']}")
Searching for "leaking faucet repair" now correctly returns the plumbing guide even though those exact words never appear in the document.
Performance Considerations
For production deployments, consider these optimizations:
- Batch embedding — process documents in batches of 64-128 to maximize GPU utilization.
- Product quantization — use
IndexIVFPQto compress vectors from 1.5KB to 48 bytes each, enabling billion-scale search. - Pre-filtering — apply metadata filters before the vector search to reduce the candidate set.
- Caching — cache frequent query embeddings to avoid re-encoding.
FAQ
What embedding model should I use for semantic search?
Start with all-MiniLM-L6-v2 for general English text. It offers excellent quality-to-speed ratio with 384 dimensions. For higher accuracy at the cost of speed, use all-mpnet-base-v2 (768 dimensions). For domain-specific needs like legal or medical text, fine-tune a base model on your domain corpus.
How does semantic search handle exact keyword matches?
Pure semantic search can sometimes miss exact matches that keyword search catches easily. The recommended approach is hybrid search: combine BM25 keyword scores with vector similarity scores using reciprocal rank fusion. This gives you the best of both worlds.
How many documents can FAISS handle on a single machine?
A flat index comfortably handles up to one million 384-dimensional vectors in about 1.5 GB of RAM. With product quantization (IndexIVFPQ), a single machine with 64 GB of RAM can index over 100 million documents while maintaining sub-10ms query latency.
#SemanticSearch #Embeddings #FAISS #InformationRetrieval #VectorSearch #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.