Embedding Cost Optimization: When to Re-Embed, Cache, or Use Smaller Models
Optimize embedding costs for AI agent systems with practical strategies for caching embeddings, selecting cost-effective models, batch sizing, and storage optimization. Reduce embedding spend by 60-80%.
The Hidden Cost of Embeddings
Embedding costs fly under the radar because individual embedding calls are cheap — $0.02 per million tokens for OpenAI’s text-embedding-3-small. But agents that perform RAG on every request, re-embed documents on every update, and store high-dimensional vectors in expensive vector databases can accumulate significant embedding-related costs. A system processing 500,000 queries daily with an average of 1,000 tokens per query spends about $10/day just on query embeddings — and that does not include document embeddings or vector storage.
Embedding Caching
The most impactful optimization is caching embeddings. Query embeddings and document embeddings should never be computed twice for the same input.
import hashlib
import json
import numpy as np
from typing import Optional, List
import redis
class EmbeddingCache:
def __init__(self, redis_url: str = "redis://localhost:6379/1"):
self.redis_client = redis.from_url(redis_url)
self.hits = 0
self.misses = 0
def _cache_key(self, text: str, model: str) -> str:
content = f"{model}:{text.strip().lower()}"
return f"emb:{hashlib.sha256(content.encode()).hexdigest()}"
def get(self, text: str, model: str) -> Optional[List[float]]:
key = self._cache_key(text, model)
cached = self.redis_client.get(key)
if cached:
self.hits += 1
return json.loads(cached)
self.misses += 1
return None
def store(self, text: str, model: str, embedding: List[float], ttl: int = 604800):
key = self._cache_key(text, model)
self.redis_client.setex(key, ttl, json.dumps(embedding))
def get_or_compute(
self,
text: str,
model: str,
compute_fn,
) -> List[float]:
cached = self.get(text, model)
if cached is not None:
return cached
embedding = compute_fn(text, model)
self.store(text, model, embedding)
return embedding
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0
Model Selection by Use Case
Not every use case needs the highest-quality embedding model. Match the model to the task requirements.
from dataclasses import dataclass
from enum import Enum
class EmbeddingUseCase(Enum):
SEMANTIC_SEARCH = "semantic_search"
CLASSIFICATION = "classification"
CLUSTERING = "clustering"
DUPLICATE_DETECTION = "duplicate_detection"
CACHING_KEYS = "caching_keys"
@dataclass
class EmbeddingModelConfig:
model: str
dimensions: int
cost_per_million_tokens: float
quality_tier: str
MODEL_RECOMMENDATIONS = {
EmbeddingUseCase.SEMANTIC_SEARCH: EmbeddingModelConfig(
model="text-embedding-3-large",
dimensions=3072,
cost_per_million_tokens=0.13,
quality_tier="high",
),
EmbeddingUseCase.CLASSIFICATION: EmbeddingModelConfig(
model="text-embedding-3-small",
dimensions=1536,
cost_per_million_tokens=0.02,
quality_tier="medium",
),
EmbeddingUseCase.CLUSTERING: EmbeddingModelConfig(
model="text-embedding-3-small",
dimensions=512,
cost_per_million_tokens=0.02,
quality_tier="medium",
),
EmbeddingUseCase.DUPLICATE_DETECTION: EmbeddingModelConfig(
model="text-embedding-3-small",
dimensions=256,
cost_per_million_tokens=0.02,
quality_tier="low",
),
EmbeddingUseCase.CACHING_KEYS: EmbeddingModelConfig(
model="text-embedding-3-small",
dimensions=256,
cost_per_million_tokens=0.02,
quality_tier="low",
),
}
def select_model(use_case: EmbeddingUseCase) -> EmbeddingModelConfig:
return MODEL_RECOMMENDATIONS[use_case]
Dimension Reduction for Storage Savings
OpenAI’s text-embedding-3 models support native dimension reduction via the dimensions parameter. Reducing from 3072 to 1024 dimensions cuts storage by 67% with only a small quality loss on most benchmarks.
import openai
class OptimizedEmbedder:
def __init__(self, client: openai.OpenAI, cache: EmbeddingCache):
self.client = client
self.cache = cache
def embed(
self,
texts: List[str],
use_case: EmbeddingUseCase,
) -> List[List[float]]:
config = select_model(use_case)
uncached_texts = []
uncached_indices = []
results: dict[int, List[float]] = {}
for i, text in enumerate(texts):
cached = self.cache.get(text, config.model)
if cached is not None:
results[i] = cached
else:
uncached_texts.append(text)
uncached_indices.append(i)
if uncached_texts:
response = self.client.embeddings.create(
model=config.model,
input=uncached_texts,
dimensions=config.dimensions,
)
for j, emb_data in enumerate(response.data):
idx = uncached_indices[j]
embedding = emb_data.embedding
results[idx] = embedding
self.cache.store(uncached_texts[j], config.model, embedding)
return [results[i] for i in range(len(texts))]
Batch Sizing for Throughput
Process embeddings in optimal batch sizes to maximize throughput and minimize overhead.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def batch_embed(
client: openai.OpenAI,
texts: List[str],
model: str = "text-embedding-3-small",
batch_size: int = 100,
dimensions: int = 1536,
) -> List[List[float]]:
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
model=model,
input=batch,
dimensions=dimensions,
)
batch_embeddings = [d.embedding for d in response.data]
all_embeddings.extend(batch_embeddings)
return all_embeddings
When to Re-Embed
Re-embedding your entire document corpus is expensive. Only re-embed when you change the embedding model, when documents have been significantly updated, or when your retrieval quality metrics show degradation. For incremental updates, embed only the changed documents and update the vector index incrementally.
FAQ
How much storage does an embedding require?
A single 1536-dimensional float32 embedding uses 6,144 bytes (about 6 KB). For 1 million documents, that is approximately 6 GB of raw embedding storage. Using float16 cuts this in half, and reducing dimensions to 512 brings it down to about 1 GB for the same corpus. Factor in vector database overhead (indexes, metadata), which typically adds 30–50% to the raw storage.
Should I use a self-hosted embedding model to save costs?
Self-hosted models like all-MiniLM-L6-v2 from Sentence Transformers are free per-token, but you pay for compute infrastructure. The breakeven point is typically around 10–50 million tokens per month — below that, API-based embedding is cheaper when you include GPU instance costs. Above that, self-hosting provides both cost savings and lower latency.
How do I handle embedding model migrations?
Never mix embeddings from different models in the same vector index — their vector spaces are incompatible. Plan migrations by creating a new index, batch-embedding all documents with the new model, switching the search to the new index, and then deleting the old index. Run both indexes in parallel during the transition to validate quality.
#Embeddings #CostOptimization #VectorDatabase #RAG #ModelSelection #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.