Skip to content
Learn Agentic AI11 min read0 views

Embedding Models for RAG: Choosing Between OpenAI, Cohere, and Open-Source

Compare embedding models for RAG pipelines across dimensions, retrieval quality, latency, and cost — including OpenAI text-embedding-3, Cohere embed-v3, and open-source sentence-transformers alternatives.

Why the Embedding Model Is Your RAG Ceiling

The embedding model determines the quality ceiling of your entire RAG pipeline. If the embedding model fails to capture the semantic relationship between a user's question and the relevant document chunk, no amount of prompt engineering on the generation side will fix it. The wrong chunk gets retrieved, and the LLM produces a confident but incorrect answer.

Choosing an embedding model involves balancing four factors: retrieval quality, vector dimensions (affects storage and search speed), latency, and cost.

OpenAI Embedding Models

OpenAI offers two tiers of embedding models, both accessed via the same API:

from openai import OpenAI

client = OpenAI()

# text-embedding-3-small — best balance of quality and cost
response = client.embeddings.create(
    input="What is the refund policy for enterprise customers?",
    model="text-embedding-3-small"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")  # 1536
print(f"First 5 values: {embedding[:5]}")

# text-embedding-3-large — highest quality, larger vectors
response_large = client.embeddings.create(
    input="What is the refund policy for enterprise customers?",
    model="text-embedding-3-large"
)

embedding_large = response_large.data[0].embedding
print(f"Dimensions: {len(embedding_large)}")  # 3072

OpenAI also supports dimension reduction via the dimensions parameter. You can shrink text-embedding-3-large from 3072 to 1024 dimensions with minimal quality loss:

response = client.embeddings.create(
    input="What is the refund policy?",
    model="text-embedding-3-large",
    dimensions=1024  # reduce from 3072
)
print(f"Reduced dimensions: {len(response.data[0].embedding)}")  # 1024
Model Dimensions MTEB Score Price per 1M tokens
text-embedding-3-small 1536 62.3 $0.02
text-embedding-3-large 3072 64.6 $0.13

Cohere Embed v3

Cohere's embed-v3 models are specifically optimized for search and retrieval tasks. A unique feature is the input_type parameter that tells the model whether you are embedding a document or a query, allowing asymmetric embeddings:

import cohere

co = cohere.Client("your-cohere-api-key")

# Embed documents (use "search_document" input_type)
doc_response = co.embed(
    texts=["Refund policy: Enterprise customers can request..."],
    model="embed-english-v3.0",
    input_type="search_document",
    embedding_types=["float"]
)

doc_embedding = doc_response.embeddings.float[0]
print(f"Document embedding dimensions: {len(doc_embedding)}")  # 1024

# Embed queries (use "search_query" input_type)
query_response = co.embed(
    texts=["What is the refund policy?"],
    model="embed-english-v3.0",
    input_type="search_query",
    embedding_types=["float"]
)

query_embedding = query_response.embeddings.float[0]
Model Dimensions MTEB Score Price per 1M tokens
embed-english-v3.0 1024 64.5 $0.10
embed-multilingual-v3.0 1024 66.3 $0.10

Pros: Asymmetric embeddings improve retrieval. Strong multilingual support. Compact 1024-dimension vectors.

Cons: Requires separate API key and billing. Smaller ecosystem than OpenAI.

Open-Source: Sentence Transformers

For teams that need full control, data privacy, or zero per-query cost, open-source models from the sentence-transformers library run locally on your hardware:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a high-quality open-source model
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Embed documents
documents = [
    "Refund policy: Enterprise customers can request a full refund...",
    "Billing cycles run from the 1st to the last day of each month...",
]
doc_embeddings = model.encode(documents, normalize_embeddings=True)
print(f"Shape: {doc_embeddings.shape}")  # (2, 1024)

# Embed a query (prepend instruction for bge models)
query = "Represent this sentence for searching relevant passages: What is the refund policy?"
query_embedding = model.encode([query], normalize_embeddings=True)

# Compute cosine similarity
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
for i, sim in enumerate(similarities):
    print(f"Doc {i}: similarity = {sim:.4f}")

Top open-source models for RAG:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Model Dimensions MTEB Score Size
BAAI/bge-large-en-v1.5 1024 63.6 1.3 GB
BAAI/bge-small-en-v1.5 384 62.2 130 MB
nomic-ai/nomic-embed-text-v1.5 768 62.3 550 MB

Pros: No API costs. Data never leaves your infrastructure. Full control over model updates. Can fine-tune on your domain.

Cons: Requires GPU for fast inference at scale. You manage model serving infrastructure. Slightly lower quality than top commercial models.

Practical Decision Framework

Choose based on your constraints:

Use OpenAI text-embedding-3-small when: You want the simplest integration, already use OpenAI for generation, and your data volume is moderate (under 10M tokens/month — costs under $0.20/month).

Use Cohere embed-v3 when: You need multilingual support, your retrieval quality is critical, or you want asymmetric document/query embeddings.

Use open-source when: You have strict data privacy requirements, high embedding volumes that would make API costs prohibitive, or you want to fine-tune the embedding model on your specific domain.

Benchmarking on Your Data

Never rely solely on MTEB leaderboard scores. Always benchmark on your actual data:

def evaluate_retrieval(model_name, queries, expected_docs, vectorstore):
    """Measure how often the correct document is in top-k results."""
    hits = 0
    for query, expected_id in zip(queries, expected_docs):
        results = vectorstore.similarity_search(query, k=5)
        retrieved_ids = [r.metadata.get("doc_id") for r in results]
        if expected_id in retrieved_ids:
            hits += 1
    recall_at_5 = hits / len(queries)
    print(f"{model_name}: Recall@5 = {recall_at_5:.2%}")
    return recall_at_5

FAQ

Does the embedding model need to match the generation model?

No. The embedding model and the generation LLM are completely independent. You can use Cohere embeddings for retrieval and GPT-4o for generation, or open-source embeddings with Claude. The only requirement is that documents and queries are embedded with the same model.

Should I use the largest embedding model available?

Not necessarily. Larger models (more dimensions) produce slightly better retrieval quality but increase storage costs and slow down similarity search. For most RAG applications, 1024-dimension models like text-embedding-3-small or bge-large-en-v1.5 offer the best quality-to-cost ratio.

Can I fine-tune an embedding model for my domain?

Yes, and it often provides significant quality improvements. Sentence-transformers supports fine-tuning with your own query-document pairs. Even 1,000 labeled pairs can measurably improve retrieval quality on domain-specific content. Commercial models like OpenAI and Cohere do not currently support embedding model fine-tuning.


#RAG #Embeddings #OpenAI #Cohere #SentenceTransformers #VectorSearch #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.