Semantic Search Autocomplete: AI-Powered Query Suggestions and Completion
Build an intelligent autocomplete system that suggests semantically relevant queries as users type, combining query embeddings with popularity weighting and user personalization for a superior search experience.
Beyond Prefix Matching
Traditional autocomplete systems use prefix matching: type "mach" and get "machine learning," "machine vision," "machining." This works for exact prefixes but fails when users phrase things differently. Typing "how to train" will never suggest "fine-tuning a neural network" with prefix matching, even though they express the same intent.
Semantic autocomplete uses embeddings to suggest queries that are semantically related to what the user has typed so far, regardless of prefix overlap. Combined with popularity signals and personalization, this creates an autocomplete experience that genuinely anticipates what users are looking for.
Building the Suggestion Index
The suggestion index stores previously successful queries along with their embeddings and popularity scores.
from dataclasses import dataclass, field
from typing import List, Optional, Dict
from sentence_transformers import SentenceTransformer
import numpy as np
from collections import defaultdict
import time
@dataclass
class QuerySuggestion:
text: str
count: int = 0 # how many times this query was searched
click_rate: float = 0.0 # fraction of searches that led to a click
last_used: float = 0.0
categories: List[str] = field(default_factory=list)
class SuggestionIndex:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
self.suggestions: List[QuerySuggestion] = []
self.embeddings: Optional[np.ndarray] = None
self.text_to_idx: Dict[str, int] = {}
def build(self, suggestions: List[QuerySuggestion]):
"""Embed and index all suggestions."""
self.suggestions = suggestions
self.text_to_idx = {
s.text.lower(): i for i, s in enumerate(suggestions)
}
texts = [s.text for s in suggestions]
self.embeddings = self.model.encode(
texts,
normalize_embeddings=True,
batch_size=128,
show_progress_bar=True,
)
print(f"Indexed {len(suggestions)} suggestions")
def add_suggestion(self, suggestion: QuerySuggestion):
"""Add a single new suggestion to the index."""
embedding = self.model.encode(
[suggestion.text], normalize_embeddings=True
)
idx = len(self.suggestions)
self.suggestions.append(suggestion)
self.text_to_idx[suggestion.text.lower()] = idx
if self.embeddings is None:
self.embeddings = embedding
else:
self.embeddings = np.vstack([self.embeddings, embedding])
def record_search(self, query_text: str, had_click: bool):
"""Update statistics when a query is executed."""
key = query_text.lower().strip()
if key in self.text_to_idx:
idx = self.text_to_idx[key]
s = self.suggestions[idx]
s.count += 1
total = s.count
s.click_rate = (
(s.click_rate * (total - 1) + (1.0 if had_click else 0.0))
/ total
)
s.last_used = time.time()
else:
self.add_suggestion(QuerySuggestion(
text=query_text.strip(),
count=1,
click_rate=1.0 if had_click else 0.0,
last_used=time.time(),
))
The Autocomplete Engine
The engine combines semantic similarity with popularity and recency signals to rank suggestions.
class SemanticAutocomplete:
def __init__(
self,
index: SuggestionIndex,
semantic_weight: float = 0.5,
popularity_weight: float = 0.3,
recency_weight: float = 0.1,
click_rate_weight: float = 0.1,
):
self.index = index
self.semantic_weight = semantic_weight
self.popularity_weight = popularity_weight
self.recency_weight = recency_weight
self.click_rate_weight = click_rate_weight
def suggest(
self,
partial_query: str,
top_k: int = 8,
prefix_boost: float = 0.2,
) -> List[Dict]:
"""Generate autocomplete suggestions for a partial query."""
if len(partial_query.strip()) < 2:
return self._popular_suggestions(top_k)
query_emb = self.index.model.encode(
[partial_query], normalize_embeddings=True
)
semantic_scores = np.dot(
self.index.embeddings, query_emb.T
).flatten()
# Normalize popularity scores
counts = np.array([
s.count for s in self.index.suggestions
], dtype=float)
max_count = max(counts.max(), 1)
popularity_scores = counts / max_count
# Recency: exponential decay, half-life of 7 days
now = time.time()
recency_scores = np.array([
np.exp(-(now - s.last_used) / (7 * 86400))
if s.last_used > 0 else 0.0
for s in self.index.suggestions
])
click_scores = np.array([
s.click_rate for s in self.index.suggestions
])
# Combined score
combined = (
self.semantic_weight * semantic_scores
+ self.popularity_weight * popularity_scores
+ self.recency_weight * recency_scores
+ self.click_rate_weight * click_scores
)
# Prefix boost for suggestions that start with the partial query
partial_lower = partial_query.lower().strip()
for i, s in enumerate(self.index.suggestions):
if s.text.lower().startswith(partial_lower):
combined[i] += prefix_boost
top_indices = np.argsort(combined)[::-1][:top_k]
results = []
for idx in top_indices:
s = self.index.suggestions[idx]
results.append({
"text": s.text,
"score": float(combined[idx]),
"semantic_score": float(semantic_scores[idx]),
"popularity": int(s.count),
"categories": s.categories,
})
return results
def _popular_suggestions(self, top_k: int) -> List[Dict]:
"""Return most popular suggestions when query is too short."""
sorted_suggestions = sorted(
enumerate(self.index.suggestions),
key=lambda x: x[1].count,
reverse=True,
)
return [
{
"text": s.text,
"score": 0.0,
"popularity": s.count,
"categories": s.categories,
}
for _, s in sorted_suggestions[:top_k]
]
Personalized Suggestions
Personalization uses the user's search history to boost suggestions that align with their interests.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class PersonalizedAutocomplete:
def __init__(self, base_engine: SemanticAutocomplete):
self.base = base_engine
self.user_profiles: Dict[str, np.ndarray] = {}
def update_profile(self, user_id: str, query: str):
"""Update user profile with their latest query."""
query_emb = self.base.index.model.encode(
[query], normalize_embeddings=True
)[0]
if user_id in self.user_profiles:
# Exponential moving average
alpha = 0.3
self.user_profiles[user_id] = (
alpha * query_emb
+ (1 - alpha) * self.user_profiles[user_id]
)
# Re-normalize
norm = np.linalg.norm(self.user_profiles[user_id])
self.user_profiles[user_id] /= norm
else:
self.user_profiles[user_id] = query_emb
def suggest(
self,
partial_query: str,
user_id: Optional[str] = None,
top_k: int = 8,
personalization_weight: float = 0.15,
) -> List[Dict]:
"""Suggest with optional personalization."""
base_results = self.base.suggest(partial_query, top_k=top_k * 2)
if user_id and user_id in self.user_profiles:
profile = self.user_profiles[user_id]
for result in base_results:
sugg_emb = self.base.index.model.encode(
[result["text"]], normalize_embeddings=True
)[0]
personal_score = float(np.dot(profile, sugg_emb))
result["score"] += personalization_weight * personal_score
result["personalized"] = True
base_results.sort(key=lambda r: r["score"], reverse=True)
return base_results[:top_k]
Building the Fast API Endpoint
Autocomplete must be fast — users expect suggestions within 50-100ms. Here is a FastAPI endpoint that serves suggestions efficiently.
from fastapi import FastAPI, Query
from fastapi.responses import JSONResponse
app = FastAPI()
# Initialize at startup
suggestion_index = SuggestionIndex()
autocomplete = SemanticAutocomplete(suggestion_index)
personalized = PersonalizedAutocomplete(autocomplete)
@app.get("/api/suggest")
async def get_suggestions(
q: str = Query(..., min_length=1, max_length=200),
user_id: str = Query(None),
limit: int = Query(8, ge=1, le=20),
):
suggestions = personalized.suggest(
partial_query=q,
user_id=user_id,
top_k=limit,
)
return JSONResponse(
content={"suggestions": suggestions},
headers={"Cache-Control": "public, max-age=60"},
)
@app.post("/api/search-event")
async def record_search(query: str, user_id: str = None, clicked: bool = False):
"""Record search execution for popularity tracking."""
suggestion_index.record_search(query, clicked)
if user_id:
personalized.update_profile(user_id, query)
return {"status": "recorded"}
Performance Optimizations
For sub-50ms response times:
- Cache embeddings — cache the partial query embedding for debounced requests where the user is still typing.
- Quantize the index — use int8 quantization for suggestion embeddings to reduce memory and speed up dot products.
- Limit candidate pool — only score the top 1000 suggestions by a cheap pre-filter (prefix match + popularity), then apply semantic scoring.
- Precompute popular — cache the top-10 popular suggestions so empty-query requests are instant.
FAQ
How do I prevent low-quality or offensive suggestions from appearing?
Maintain a blocklist of terms and patterns that should never appear in suggestions. Before adding any new query to the suggestion index, run it through a content filter. Additionally, set a minimum search count threshold (e.g., 3 searches) before a query becomes eligible for suggestions. This prevents one-off typos or adversarial queries from polluting the suggestion pool.
How often should I rebuild the suggestion index vs updating it incrementally?
Use incremental updates (add_suggestion and record_search) for real-time responsiveness, and schedule a full rebuild weekly. The rebuild recalculates all embeddings (catching model improvements), prunes suggestions with zero searches in the last 30 days, and recomputes normalized popularity scores. This keeps the index clean and the scores well-calibrated without disrupting service.
How do I handle misspelled partial queries?
Combine semantic autocomplete with a lightweight spell-correction layer. Before embedding the partial query, check if it has a close match in your suggestion vocabulary using edit distance. If the corrected form has significantly higher popularity, use the corrected embedding. Libraries like symspellpy provide fast spell correction that adds under 1ms of latency. The semantic embedding itself is somewhat robust to minor typos since transformer tokenizers handle subword variations.
#Autocomplete #QuerySuggestions #SearchUX #SemanticSearch #Personalization #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.