Embedding Models for RAG: Choosing Between OpenAI, Cohere, and Open-Source
Compare embedding models for RAG pipelines across dimensions, retrieval quality, latency, and cost — including OpenAI text-embedding-3, Cohere embed-v3, and open-source sentence-transformers alternatives.
Why the Embedding Model Is Your RAG Ceiling
The embedding model determines the quality ceiling of your entire RAG pipeline. If the embedding model fails to capture the semantic relationship between a user's question and the relevant document chunk, no amount of prompt engineering on the generation side will fix it. The wrong chunk gets retrieved, and the LLM produces a confident but incorrect answer.
Choosing an embedding model involves balancing four factors: retrieval quality, vector dimensions (affects storage and search speed), latency, and cost.
OpenAI Embedding Models
OpenAI offers two tiers of embedding models, both accessed via the same API:
from openai import OpenAI
client = OpenAI()
# text-embedding-3-small — best balance of quality and cost
response = client.embeddings.create(
input="What is the refund policy for enterprise customers?",
model="text-embedding-3-small"
)
embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}") # 1536
print(f"First 5 values: {embedding[:5]}")
# text-embedding-3-large — highest quality, larger vectors
response_large = client.embeddings.create(
input="What is the refund policy for enterprise customers?",
model="text-embedding-3-large"
)
embedding_large = response_large.data[0].embedding
print(f"Dimensions: {len(embedding_large)}") # 3072
OpenAI also supports dimension reduction via the dimensions parameter. You can shrink text-embedding-3-large from 3072 to 1024 dimensions with minimal quality loss:
response = client.embeddings.create(
input="What is the refund policy?",
model="text-embedding-3-large",
dimensions=1024 # reduce from 3072
)
print(f"Reduced dimensions: {len(response.data[0].embedding)}") # 1024
| Model | Dimensions | MTEB Score | Price per 1M tokens |
|---|---|---|---|
| text-embedding-3-small | 1536 | 62.3 | $0.02 |
| text-embedding-3-large | 3072 | 64.6 | $0.13 |
Cohere Embed v3
Cohere's embed-v3 models are specifically optimized for search and retrieval tasks. A unique feature is the input_type parameter that tells the model whether you are embedding a document or a query, allowing asymmetric embeddings:
import cohere
co = cohere.Client("your-cohere-api-key")
# Embed documents (use "search_document" input_type)
doc_response = co.embed(
texts=["Refund policy: Enterprise customers can request..."],
model="embed-english-v3.0",
input_type="search_document",
embedding_types=["float"]
)
doc_embedding = doc_response.embeddings.float[0]
print(f"Document embedding dimensions: {len(doc_embedding)}") # 1024
# Embed queries (use "search_query" input_type)
query_response = co.embed(
texts=["What is the refund policy?"],
model="embed-english-v3.0",
input_type="search_query",
embedding_types=["float"]
)
query_embedding = query_response.embeddings.float[0]
| Model | Dimensions | MTEB Score | Price per 1M tokens |
|---|---|---|---|
| embed-english-v3.0 | 1024 | 64.5 | $0.10 |
| embed-multilingual-v3.0 | 1024 | 66.3 | $0.10 |
Pros: Asymmetric embeddings improve retrieval. Strong multilingual support. Compact 1024-dimension vectors.
Cons: Requires separate API key and billing. Smaller ecosystem than OpenAI.
Open-Source: Sentence Transformers
For teams that need full control, data privacy, or zero per-query cost, open-source models from the sentence-transformers library run locally on your hardware:
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a high-quality open-source model
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Embed documents
documents = [
"Refund policy: Enterprise customers can request a full refund...",
"Billing cycles run from the 1st to the last day of each month...",
]
doc_embeddings = model.encode(documents, normalize_embeddings=True)
print(f"Shape: {doc_embeddings.shape}") # (2, 1024)
# Embed a query (prepend instruction for bge models)
query = "Represent this sentence for searching relevant passages: What is the refund policy?"
query_embedding = model.encode([query], normalize_embeddings=True)
# Compute cosine similarity
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
for i, sim in enumerate(similarities):
print(f"Doc {i}: similarity = {sim:.4f}")
Top open-source models for RAG:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
| Model | Dimensions | MTEB Score | Size |
|---|---|---|---|
| BAAI/bge-large-en-v1.5 | 1024 | 63.6 | 1.3 GB |
| BAAI/bge-small-en-v1.5 | 384 | 62.2 | 130 MB |
| nomic-ai/nomic-embed-text-v1.5 | 768 | 62.3 | 550 MB |
Pros: No API costs. Data never leaves your infrastructure. Full control over model updates. Can fine-tune on your domain.
Cons: Requires GPU for fast inference at scale. You manage model serving infrastructure. Slightly lower quality than top commercial models.
Practical Decision Framework
Choose based on your constraints:
Use OpenAI text-embedding-3-small when: You want the simplest integration, already use OpenAI for generation, and your data volume is moderate (under 10M tokens/month — costs under $0.20/month).
Use Cohere embed-v3 when: You need multilingual support, your retrieval quality is critical, or you want asymmetric document/query embeddings.
Use open-source when: You have strict data privacy requirements, high embedding volumes that would make API costs prohibitive, or you want to fine-tune the embedding model on your specific domain.
Benchmarking on Your Data
Never rely solely on MTEB leaderboard scores. Always benchmark on your actual data:
def evaluate_retrieval(model_name, queries, expected_docs, vectorstore):
"""Measure how often the correct document is in top-k results."""
hits = 0
for query, expected_id in zip(queries, expected_docs):
results = vectorstore.similarity_search(query, k=5)
retrieved_ids = [r.metadata.get("doc_id") for r in results]
if expected_id in retrieved_ids:
hits += 1
recall_at_5 = hits / len(queries)
print(f"{model_name}: Recall@5 = {recall_at_5:.2%}")
return recall_at_5
FAQ
Does the embedding model need to match the generation model?
No. The embedding model and the generation LLM are completely independent. You can use Cohere embeddings for retrieval and GPT-4o for generation, or open-source embeddings with Claude. The only requirement is that documents and queries are embedded with the same model.
Should I use the largest embedding model available?
Not necessarily. Larger models (more dimensions) produce slightly better retrieval quality but increase storage costs and slow down similarity search. For most RAG applications, 1024-dimension models like text-embedding-3-small or bge-large-en-v1.5 offer the best quality-to-cost ratio.
Can I fine-tune an embedding model for my domain?
Yes, and it often provides significant quality improvements. Sentence-transformers supports fine-tuning with your own query-document pairs. Even 1,000 labeled pairs can measurably improve retrieval quality on domain-specific content. Commercial models like OpenAI and Cohere do not currently support embedding model fine-tuning.
#RAG #Embeddings #OpenAI #Cohere #SentenceTransformers #VectorSearch #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.