Skip to content
Learn Agentic AI13 min read0 views

Multi-Language Semantic Search: Cross-Lingual Retrieval with Multilingual Embeddings

Implement cross-lingual semantic search that lets users query in one language and retrieve results in any language, using multilingual embedding models that map all languages into a shared vector space.

Building search for a multilingual corpus traditionally requires maintaining separate indexes per language, implementing language detection, and often translating queries at runtime. This approach is fragile — translation introduces errors, language detection fails on short queries, and maintaining N separate pipelines is expensive.

Multilingual embedding models offer an elegant alternative: they map text from any supported language into the same vector space. A question in Japanese and its answer in English end up near each other, enabling true cross-lingual retrieval without any translation step.

Choosing a Multilingual Embedding Model

from sentence_transformers import SentenceTransformer
import numpy as np

# Model comparison for multilingual semantic search
MULTILINGUAL_MODELS = {
    "paraphrase-multilingual-MiniLM-L12-v2": {
        "languages": 50,
        "dimensions": 384,
        "speed": "fast",
        "quality": "good",
    },
    "paraphrase-multilingual-mpnet-base-v2": {
        "languages": 50,
        "dimensions": 768,
        "speed": "medium",
        "quality": "excellent",
    },
    "distiluse-base-multilingual-cased-v2": {
        "languages": 15,
        "dimensions": 512,
        "speed": "fast",
        "quality": "moderate",
    },
}

# For most use cases, this is the best balance
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

The paraphrase-multilingual-MiniLM-L12-v2 model supports 50 languages, produces 384-dimensional vectors, and runs efficiently on CPU. It maps semantically equivalent sentences in different languages to nearby points in vector space.

Cross-Lingual Search Engine

from typing import List, Dict, Optional
import numpy as np

class MultilingualSearchEngine:
    def __init__(
        self, model_name: str = "paraphrase-multilingual-MiniLM-L12-v2"
    ):
        self.model = SentenceTransformer(model_name)
        self.documents: List[Dict] = []
        self.embeddings: Optional[np.ndarray] = None

    def index_documents(self, documents: List[Dict]):
        """Index documents in any language."""
        self.documents = documents
        texts = [
            f"{d.get('title', '')}. {d.get('body', '')}" for d in documents
        ]
        self.embeddings = self.model.encode(
            texts,
            normalize_embeddings=True,
            batch_size=64,
            show_progress_bar=True,
        )
        print(f"Indexed {len(documents)} documents across languages")

    def search(
        self,
        query: str,
        top_k: int = 10,
        language_filter: Optional[str] = None,
    ) -> List[Dict]:
        """Search in any language, retrieve results from all languages."""
        query_emb = self.model.encode(
            [query], normalize_embeddings=True
        )
        scores = np.dot(self.embeddings, query_emb.T).flatten()
        top_indices = np.argsort(scores)[::-1]

        results = []
        for idx in top_indices:
            if len(results) >= top_k:
                break
            doc = self.documents[idx]
            if language_filter and doc.get("language") != language_filter:
                continue
            result = doc.copy()
            result["score"] = float(scores[idx])
            results.append(result)
        return results

Demonstrating Cross-Lingual Retrieval

# Documents in multiple languages
documents = [
    {
        "title": "How to make pasta carbonara",
        "body": "Cook spaghetti, mix eggs with pecorino, combine with guanciale.",
        "language": "en",
    },
    {
        "title": "Comment faire des crepes",
        "body": "Melanger farine, oeufs, lait. Cuire dans une poele chaude.",
        "language": "fr",
    },
    {
        "title": "Wie man Brot backt",
        "body": "Mehl, Wasser, Hefe und Salz mischen. Teig kneten und backen.",
        "language": "de",
    },
    {
        "title": "Como hacer tortillas",
        "body": "Mezclar harina de maiz con agua y sal. Formar discos y cocinar.",
        "language": "es",
    },
]

engine = MultilingualSearchEngine()
engine.index_documents(documents)

# Search in English, find results in all languages
results = engine.search("recipe for bread")
for r in results:
    print(f"[{r['language']}] {r['score']:.3f} — {r['title']}")
# Output:
# [de] 0.742 — Wie man Brot backt
# [en] 0.531 — How to make pasta carbonara
# ...

The German bread-baking document ranks highest for the English query "recipe for bread" — no translation needed.

Translation vs Cross-Lingual Embeddings

When should you translate queries versus use cross-lingual embeddings directly?

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from dataclasses import dataclass

@dataclass
class ApproachComparison:
    approach: str
    pros: List[str]
    cons: List[str]
    best_for: str

approaches = [
    ApproachComparison(
        approach="Cross-lingual embeddings (no translation)",
        pros=[
            "No translation API cost or latency",
            "Works for low-resource languages",
            "Single unified index",
        ],
        cons=[
            "5-10% quality drop vs same-language search",
            "Struggles with domain-specific terminology",
        ],
        best_for="General-purpose multilingual search",
    ),
    ApproachComparison(
        approach="Translate query, then monolingual search",
        pros=[
            "Highest retrieval quality per language",
            "Leverages best monolingual models",
        ],
        cons=[
            "Translation adds 100-500ms latency",
            "Translation errors propagate to search",
            "Requires separate index per language",
        ],
        best_for="High-stakes search where precision is critical",
    ),
    ApproachComparison(
        approach="Hybrid: cross-lingual + translate and re-rank",
        pros=[
            "Best of both approaches",
            "Cross-lingual provides recall, translation improves precision",
        ],
        cons=[
            "Most complex to implement and maintain",
            "Higher latency from translation step",
        ],
        best_for="Production systems with quality requirements",
    ),
]

Language-Aware Scoring

For better results, boost documents that match the query language while still returning cross-lingual results.

from langdetect import detect

def language_aware_search(
    engine: MultilingualSearchEngine,
    query: str,
    top_k: int = 10,
    same_language_boost: float = 0.1,
) -> List[Dict]:
    """Boost same-language results while preserving cross-lingual ones."""
    try:
        query_language = detect(query)
    except Exception:
        query_language = None

    results = engine.search(query, top_k=top_k * 2)

    for result in results:
        if query_language and result.get("language") == query_language:
            result["score"] += same_language_boost
            result["language_boosted"] = True

    results.sort(key=lambda r: r["score"], reverse=True)
    return results[:top_k]

FAQ

How well do multilingual models handle languages with non-Latin scripts like Chinese, Arabic, or Korean?

The paraphrase-multilingual-MiniLM-L12-v2 model handles these well because it was trained on parallel sentence pairs across 50 languages including Chinese, Arabic, Korean, Japanese, Hindi, and Thai. Performance is slightly lower for very low-resource languages like Swahili or Yoruba, but still usable for general-purpose search.

Can I mix languages within a single document?

Yes, multilingual models handle code-switched text (e.g., "I want to order biryani for dinner") reasonably well. The model captures the semantic meaning regardless of which languages are mixed. However, very long documents with extensive code-switching may lose some accuracy — in that case, consider splitting by language segment.

What is the embedding quality difference between multilingual and monolingual models?

On same-language benchmarks, monolingual English models like all-MiniLM-L6-v2 score about 5-10% higher than their multilingual counterparts on English text. The multilingual model sacrifices some per-language quality to achieve cross-lingual alignment. For most applications, this tradeoff is worthwhile because you get a single unified system.


#Multilingual #CrossLingualSearch #SemanticSearch #NLP #Embeddings #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.