Open-Source Embedding Models: Sentence-Transformers and BGE for RAG Agents
Select, deploy, and optimize open-source embedding models for RAG-powered agents. Compare Sentence-Transformers, BGE, and E5 models with benchmarks, fine-tuning strategies, and deployment patterns.
Why Embedding Models Matter for Agents
Retrieval-Augmented Generation (RAG) is the most common pattern for building agents that work with private data. The embedding model is the backbone of RAG — it converts text into vectors that enable semantic search. A poor embedding model means your agent retrieves irrelevant documents, and no amount of LLM quality can compensate for bad retrieval.
Open-source embedding models have caught up to and often surpassed proprietary offerings. The MTEB (Massive Text Embedding Benchmark) leaderboard shows open models like BGE, E5, and GTE consistently competing with OpenAI's Ada and Cohere's embedding APIs, while running locally at zero cost.
Top Open-Source Embedding Models
BAAI/bge-large-en-v1.5 — 335M parameters, 1024-dimensional embeddings. Currently one of the best-performing open models on MTEB. Excellent for English-language RAG.
intfloat/e5-large-v2 — 335M parameters, 1024 dimensions. Strong alternative to BGE with slightly different strengths across benchmark categories. Requires a "query: " or "passage: " prefix.
BAAI/bge-m3 — A multilingual model supporting 100+ languages with dense, sparse, and multi-vector retrieval in a single model. Ideal for multilingual agent deployments.
nomic-ai/nomic-embed-text-v1.5 — 137M parameters, 768 dimensions. Excellent quality-to-size ratio with a Matryoshka representation that allows flexible dimensionality.
Getting Started with Sentence-Transformers
The sentence-transformers library is the standard way to load and use embedding models:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from sentence_transformers import SentenceTransformer
import numpy as np
# Load the model (downloads ~1.3 GB on first run)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Encode documents
documents = [
"AI agents can autonomously plan and execute tasks.",
"Retrieval-augmented generation improves factual accuracy.",
"Vector databases store high-dimensional embeddings efficiently.",
"The weather in Paris is mild in spring.",
]
doc_embeddings = model.encode(documents, normalize_embeddings=True)
print(f"Embedding shape: {doc_embeddings.shape}") # (4, 1024)
# Encode a query
query = "How do AI agents use retrieval?"
query_embedding = model.encode([query], normalize_embeddings=True)
# Compute cosine similarity (dot product since normalized)
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
# Rank results
ranked = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
for doc, score in ranked:
print(f"{score:.4f}: {doc}")
Optimizing Embedding Performance
For production agents processing thousands of documents, performance matters. Here are the key optimizations:
Batch encoding — Always encode in batches rather than one document at a time:
# Slow: encoding one by one
for doc in documents:
embedding = model.encode([doc])
# Fast: batch encoding with GPU
embeddings = model.encode(
documents,
batch_size=64, # Process 64 documents at once
show_progress_bar=True,
normalize_embeddings=True,
device="cuda", # Use GPU
)
Quantized embeddings — Reduce storage and search costs by quantizing float32 vectors to int8 or binary:
from sentence_transformers.quantization import quantize_embeddings
# Full precision: 1024 dimensions x 4 bytes = 4 KB per document
float_embeddings = model.encode(documents, normalize_embeddings=True)
# Int8 quantization: 1024 bytes per document (75% smaller)
int8_embeddings = quantize_embeddings(
float_embeddings, precision="int8"
)
# Binary quantization: 128 bytes per document (97% smaller)
binary_embeddings = quantize_embeddings(
float_embeddings, precision="binary"
)
Building a RAG Pipeline with Local Embeddings
Here is a complete RAG pipeline using local models for both embedding and generation:
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import chromadb
# Local embedding model
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Local LLM via Ollama
llm = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# Local vector database
db = chromadb.PersistentClient(path="./vectordb")
collection = db.get_or_create_collection(
"knowledge_base",
metadata={"hnsw:space": "cosine"},
)
def ingest(doc_id: str, text: str):
embedding = embedder.encode([text], normalize_embeddings=True)[0]
collection.upsert(
ids=[doc_id],
embeddings=[embedding.tolist()],
documents=[text],
)
def retrieve(query: str, top_k: int = 5) -> list[str]:
query_emb = embedder.encode([query], normalize_embeddings=True)[0]
results = collection.query(
query_embeddings=[query_emb.tolist()],
n_results=top_k,
)
return results["documents"][0]
def rag_query(user_question: str) -> str:
# Retrieve relevant context
docs = retrieve(user_question)
context = "\n\n---\n\n".join(docs)
# Generate answer with local LLM
response = llm.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "system", "content":
f"Answer based on this context. If the context does not contain "
f"the answer, say so.\n\nContext:\n{context}"},
{"role": "user", "content": user_question},
],
temperature=0.2,
)
return response.choices[0].message.content
# Index some documents
ingest("doc1", "AI agents use tool calling to interact with external systems.")
ingest("doc2", "RAG improves LLM accuracy by providing relevant context.")
ingest("doc3", "ChromaDB is an open-source vector database for embeddings.")
# Query
answer = rag_query("How do agents interact with external systems?")
print(answer)
Fine-Tuning Embeddings for Your Domain
Generic embedding models work well out of the box, but fine-tuning on your domain data can improve retrieval quality by 5-15%. Sentence-Transformers makes this straightforward:
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Training data: (query, positive_document) pairs
train_examples = [
InputExample(texts=["What are AI agents?",
"AI agents are autonomous systems that perceive, reason, and act."]),
InputExample(texts=["How does RAG work?",
"RAG retrieves relevant documents and includes them in the LLM prompt."]),
# Add hundreds more domain-specific pairs
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./fine-tuned-embeddings",
)
FAQ
Should I use BGE, E5, or nomic-embed for my RAG agent?
For English-only applications, BGE-large-en-v1.5 is the safest choice — it ranks highest on most MTEB categories. For multilingual needs, use BGE-M3. If you need a smaller model for edge deployment, nomic-embed-text-v1.5 offers the best quality-per-parameter ratio.
How many dimensions should my embeddings have?
1024 dimensions (BGE-large, E5-large) provide the best retrieval quality. If storage or search speed is a concern, nomic-embed supports Matryoshka dimensionality — you can truncate to 256 or 512 dimensions with only minor quality loss. Binary quantization of 1024-dim vectors is another effective way to reduce storage.
Do I need to re-embed all documents when switching embedding models?
Yes. Embeddings from different models are not compatible — they exist in different vector spaces. When you upgrade your embedding model, you must re-encode your entire document corpus and rebuild the vector index. Plan for this in your deployment strategy.
#Embeddings #SentenceTransformers #BGE #RAG #VectorSearch #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.