Skip to content
Learn Agentic AI13 min read0 views

Semantic Search Autocomplete: AI-Powered Query Suggestions and Completion

Build an intelligent autocomplete system that suggests semantically relevant queries as users type, combining query embeddings with popularity weighting and user personalization for a superior search experience.

Beyond Prefix Matching

Traditional autocomplete systems use prefix matching: type "mach" and get "machine learning," "machine vision," "machining." This works for exact prefixes but fails when users phrase things differently. Typing "how to train" will never suggest "fine-tuning a neural network" with prefix matching, even though they express the same intent.

Semantic autocomplete uses embeddings to suggest queries that are semantically related to what the user has typed so far, regardless of prefix overlap. Combined with popularity signals and personalization, this creates an autocomplete experience that genuinely anticipates what users are looking for.

Building the Suggestion Index

The suggestion index stores previously successful queries along with their embeddings and popularity scores.

from dataclasses import dataclass, field
from typing import List, Optional, Dict
from sentence_transformers import SentenceTransformer
import numpy as np
from collections import defaultdict
import time

@dataclass
class QuerySuggestion:
    text: str
    count: int = 0          # how many times this query was searched
    click_rate: float = 0.0  # fraction of searches that led to a click
    last_used: float = 0.0
    categories: List[str] = field(default_factory=list)

class SuggestionIndex:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.suggestions: List[QuerySuggestion] = []
        self.embeddings: Optional[np.ndarray] = None
        self.text_to_idx: Dict[str, int] = {}

    def build(self, suggestions: List[QuerySuggestion]):
        """Embed and index all suggestions."""
        self.suggestions = suggestions
        self.text_to_idx = {
            s.text.lower(): i for i, s in enumerate(suggestions)
        }
        texts = [s.text for s in suggestions]
        self.embeddings = self.model.encode(
            texts,
            normalize_embeddings=True,
            batch_size=128,
            show_progress_bar=True,
        )
        print(f"Indexed {len(suggestions)} suggestions")

    def add_suggestion(self, suggestion: QuerySuggestion):
        """Add a single new suggestion to the index."""
        embedding = self.model.encode(
            [suggestion.text], normalize_embeddings=True
        )
        idx = len(self.suggestions)
        self.suggestions.append(suggestion)
        self.text_to_idx[suggestion.text.lower()] = idx

        if self.embeddings is None:
            self.embeddings = embedding
        else:
            self.embeddings = np.vstack([self.embeddings, embedding])

    def record_search(self, query_text: str, had_click: bool):
        """Update statistics when a query is executed."""
        key = query_text.lower().strip()
        if key in self.text_to_idx:
            idx = self.text_to_idx[key]
            s = self.suggestions[idx]
            s.count += 1
            total = s.count
            s.click_rate = (
                (s.click_rate * (total - 1) + (1.0 if had_click else 0.0))
                / total
            )
            s.last_used = time.time()
        else:
            self.add_suggestion(QuerySuggestion(
                text=query_text.strip(),
                count=1,
                click_rate=1.0 if had_click else 0.0,
                last_used=time.time(),
            ))

The Autocomplete Engine

The engine combines semantic similarity with popularity and recency signals to rank suggestions.

class SemanticAutocomplete:
    def __init__(
        self,
        index: SuggestionIndex,
        semantic_weight: float = 0.5,
        popularity_weight: float = 0.3,
        recency_weight: float = 0.1,
        click_rate_weight: float = 0.1,
    ):
        self.index = index
        self.semantic_weight = semantic_weight
        self.popularity_weight = popularity_weight
        self.recency_weight = recency_weight
        self.click_rate_weight = click_rate_weight

    def suggest(
        self,
        partial_query: str,
        top_k: int = 8,
        prefix_boost: float = 0.2,
    ) -> List[Dict]:
        """Generate autocomplete suggestions for a partial query."""
        if len(partial_query.strip()) < 2:
            return self._popular_suggestions(top_k)

        query_emb = self.index.model.encode(
            [partial_query], normalize_embeddings=True
        )
        semantic_scores = np.dot(
            self.index.embeddings, query_emb.T
        ).flatten()

        # Normalize popularity scores
        counts = np.array([
            s.count for s in self.index.suggestions
        ], dtype=float)
        max_count = max(counts.max(), 1)
        popularity_scores = counts / max_count

        # Recency: exponential decay, half-life of 7 days
        now = time.time()
        recency_scores = np.array([
            np.exp(-(now - s.last_used) / (7 * 86400))
            if s.last_used > 0 else 0.0
            for s in self.index.suggestions
        ])

        click_scores = np.array([
            s.click_rate for s in self.index.suggestions
        ])

        # Combined score
        combined = (
            self.semantic_weight * semantic_scores
            + self.popularity_weight * popularity_scores
            + self.recency_weight * recency_scores
            + self.click_rate_weight * click_scores
        )

        # Prefix boost for suggestions that start with the partial query
        partial_lower = partial_query.lower().strip()
        for i, s in enumerate(self.index.suggestions):
            if s.text.lower().startswith(partial_lower):
                combined[i] += prefix_boost

        top_indices = np.argsort(combined)[::-1][:top_k]

        results = []
        for idx in top_indices:
            s = self.index.suggestions[idx]
            results.append({
                "text": s.text,
                "score": float(combined[idx]),
                "semantic_score": float(semantic_scores[idx]),
                "popularity": int(s.count),
                "categories": s.categories,
            })
        return results

    def _popular_suggestions(self, top_k: int) -> List[Dict]:
        """Return most popular suggestions when query is too short."""
        sorted_suggestions = sorted(
            enumerate(self.index.suggestions),
            key=lambda x: x[1].count,
            reverse=True,
        )
        return [
            {
                "text": s.text,
                "score": 0.0,
                "popularity": s.count,
                "categories": s.categories,
            }
            for _, s in sorted_suggestions[:top_k]
        ]

Personalized Suggestions

Personalization uses the user's search history to boost suggestions that align with their interests.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class PersonalizedAutocomplete:
    def __init__(self, base_engine: SemanticAutocomplete):
        self.base = base_engine
        self.user_profiles: Dict[str, np.ndarray] = {}

    def update_profile(self, user_id: str, query: str):
        """Update user profile with their latest query."""
        query_emb = self.base.index.model.encode(
            [query], normalize_embeddings=True
        )[0]

        if user_id in self.user_profiles:
            # Exponential moving average
            alpha = 0.3
            self.user_profiles[user_id] = (
                alpha * query_emb
                + (1 - alpha) * self.user_profiles[user_id]
            )
            # Re-normalize
            norm = np.linalg.norm(self.user_profiles[user_id])
            self.user_profiles[user_id] /= norm
        else:
            self.user_profiles[user_id] = query_emb

    def suggest(
        self,
        partial_query: str,
        user_id: Optional[str] = None,
        top_k: int = 8,
        personalization_weight: float = 0.15,
    ) -> List[Dict]:
        """Suggest with optional personalization."""
        base_results = self.base.suggest(partial_query, top_k=top_k * 2)

        if user_id and user_id in self.user_profiles:
            profile = self.user_profiles[user_id]
            for result in base_results:
                sugg_emb = self.base.index.model.encode(
                    [result["text"]], normalize_embeddings=True
                )[0]
                personal_score = float(np.dot(profile, sugg_emb))
                result["score"] += personalization_weight * personal_score
                result["personalized"] = True

            base_results.sort(key=lambda r: r["score"], reverse=True)

        return base_results[:top_k]

Building the Fast API Endpoint

Autocomplete must be fast — users expect suggestions within 50-100ms. Here is a FastAPI endpoint that serves suggestions efficiently.

from fastapi import FastAPI, Query
from fastapi.responses import JSONResponse

app = FastAPI()

# Initialize at startup
suggestion_index = SuggestionIndex()
autocomplete = SemanticAutocomplete(suggestion_index)
personalized = PersonalizedAutocomplete(autocomplete)

@app.get("/api/suggest")
async def get_suggestions(
    q: str = Query(..., min_length=1, max_length=200),
    user_id: str = Query(None),
    limit: int = Query(8, ge=1, le=20),
):
    suggestions = personalized.suggest(
        partial_query=q,
        user_id=user_id,
        top_k=limit,
    )
    return JSONResponse(
        content={"suggestions": suggestions},
        headers={"Cache-Control": "public, max-age=60"},
    )

@app.post("/api/search-event")
async def record_search(query: str, user_id: str = None, clicked: bool = False):
    """Record search execution for popularity tracking."""
    suggestion_index.record_search(query, clicked)
    if user_id:
        personalized.update_profile(user_id, query)
    return {"status": "recorded"}

Performance Optimizations

For sub-50ms response times:

  1. Cache embeddings — cache the partial query embedding for debounced requests where the user is still typing.
  2. Quantize the index — use int8 quantization for suggestion embeddings to reduce memory and speed up dot products.
  3. Limit candidate pool — only score the top 1000 suggestions by a cheap pre-filter (prefix match + popularity), then apply semantic scoring.
  4. Precompute popular — cache the top-10 popular suggestions so empty-query requests are instant.

FAQ

How do I prevent low-quality or offensive suggestions from appearing?

Maintain a blocklist of terms and patterns that should never appear in suggestions. Before adding any new query to the suggestion index, run it through a content filter. Additionally, set a minimum search count threshold (e.g., 3 searches) before a query becomes eligible for suggestions. This prevents one-off typos or adversarial queries from polluting the suggestion pool.

How often should I rebuild the suggestion index vs updating it incrementally?

Use incremental updates (add_suggestion and record_search) for real-time responsiveness, and schedule a full rebuild weekly. The rebuild recalculates all embeddings (catching model improvements), prunes suggestions with zero searches in the last 30 days, and recomputes normalized popularity scores. This keeps the index clean and the scores well-calibrated without disrupting service.

How do I handle misspelled partial queries?

Combine semantic autocomplete with a lightweight spell-correction layer. Before embedding the partial query, check if it has a close match in your suggestion vocabulary using edit distance. If the corrected form has significantly higher popularity, use the corrected embedding. Libraries like symspellpy provide fast spell correction that adds under 1ms of latency. The semantic embedding itself is somewhat robust to minor typos since transformer tokenizers handle subword variations.


#Autocomplete #QuerySuggestions #SearchUX #SemanticSearch #Personalization #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.