Embedding Cost Optimization: When to Re-Embed, Cache, or Use Smaller Models

The Hidden Cost of Embeddings

Embedding costs fly under the radar because individual embedding calls are cheap — $0.02 per million tokens for OpenAI’s text-embedding-3-small. But agents that perform RAG on every request, re-embed documents on every update, and store high-dimensional vectors in expensive vector databases can accumulate significant embedding-related costs. A system processing 500,000 queries daily with an average of 1,000 tokens per query spends about $10/day just on query embeddings — and that does not include document embeddings or vector storage.

Embedding Caching

The most impactful optimization is caching embeddings. Query embeddings and document embeddings should never be computed twice for the same input.

import hashlib
import json
import numpy as np
from typing import Optional, List
import redis

class EmbeddingCache:
    def __init__(self, redis_url: str = "redis://localhost:6379/1"):
        self.redis_client = redis.from_url(redis_url)
        self.hits = 0
        self.misses = 0

    def _cache_key(self, text: str, model: str) -> str:
        content = f"{model}:{text.strip().lower()}"
        return f"emb:{hashlib.sha256(content.encode()).hexdigest()}"

    def get(self, text: str, model: str) -> Optional[List[float]]:
        key = self._cache_key(text, model)
        cached = self.redis_client.get(key)
        if cached:
            self.hits += 1
            return json.loads(cached)
        self.misses += 1
        return None

    def store(self, text: str, model: str, embedding: List[float], ttl: int = 604800):
        key = self._cache_key(text, model)
        self.redis_client.setex(key, ttl, json.dumps(embedding))

    def get_or_compute(
        self,
        text: str,
        model: str,
        compute_fn,
    ) -> List[float]:
        cached = self.get(text, model)
        if cached is not None:
            return cached
        embedding = compute_fn(text, model)
        self.store(text, model, embedding)
        return embedding

    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

Model Selection by Use Case

Not every use case needs the highest-quality embedding model. Match the model to the task requirements.

from dataclasses import dataclass
from enum import Enum

class EmbeddingUseCase(Enum):
    SEMANTIC_SEARCH = "semantic_search"
    CLASSIFICATION = "classification"
    CLUSTERING = "clustering"
    DUPLICATE_DETECTION = "duplicate_detection"
    CACHING_KEYS = "caching_keys"

@dataclass
class EmbeddingModelConfig:
    model: str
    dimensions: int
    cost_per_million_tokens: float
    quality_tier: str

MODEL_RECOMMENDATIONS = {
    EmbeddingUseCase.SEMANTIC_SEARCH: EmbeddingModelConfig(
        model="text-embedding-3-large",
        dimensions=3072,
        cost_per_million_tokens=0.13,
        quality_tier="high",
    ),
    EmbeddingUseCase.CLASSIFICATION: EmbeddingModelConfig(
        model="text-embedding-3-small",
        dimensions=1536,
        cost_per_million_tokens=0.02,
        quality_tier="medium",
    ),
    EmbeddingUseCase.CLUSTERING: EmbeddingModelConfig(
        model="text-embedding-3-small",
        dimensions=512,
        cost_per_million_tokens=0.02,
        quality_tier="medium",
    ),
    EmbeddingUseCase.DUPLICATE_DETECTION: EmbeddingModelConfig(
        model="text-embedding-3-small",
        dimensions=256,
        cost_per_million_tokens=0.02,
        quality_tier="low",
    ),
    EmbeddingUseCase.CACHING_KEYS: EmbeddingModelConfig(
        model="text-embedding-3-small",
        dimensions=256,
        cost_per_million_tokens=0.02,
        quality_tier="low",
    ),
}

def select_model(use_case: EmbeddingUseCase) -> EmbeddingModelConfig:
    return MODEL_RECOMMENDATIONS[use_case]

Dimension Reduction for Storage Savings

OpenAI’s text-embedding-3 models support native dimension reduction via the dimensions parameter. Reducing from 3072 to 1024 dimensions cuts storage by 67% with only a small quality loss on most benchmarks.

import openai

class OptimizedEmbedder:
    def __init__(self, client: openai.OpenAI, cache: EmbeddingCache):
        self.client = client
        self.cache = cache

    def embed(
        self,
        texts: List[str],
        use_case: EmbeddingUseCase,
    ) -> List[List[float]]:
        config = select_model(use_case)
        uncached_texts = []
        uncached_indices = []
        results: dict[int, List[float]] = {}

        for i, text in enumerate(texts):
            cached = self.cache.get(text, config.model)
            if cached is not None:
                results[i] = cached
            else:
                uncached_texts.append(text)
                uncached_indices.append(i)

        if uncached_texts:
            response = self.client.embeddings.create(
                model=config.model,
                input=uncached_texts,
                dimensions=config.dimensions,
            )
            for j, emb_data in enumerate(response.data):
                idx = uncached_indices[j]
                embedding = emb_data.embedding
                results[idx] = embedding
                self.cache.store(uncached_texts[j], config.model, embedding)

        return [results[i] for i in range(len(texts))]

Batch Sizing for Throughput

Process embeddings in optimal batch sizes to maximize throughput and minimize overhead.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def batch_embed(
    client: openai.OpenAI,
    texts: List[str],
    model: str = "text-embedding-3-small",
    batch_size: int = 100,
    dimensions: int = 1536,
) -> List[List[float]]:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model=model,
            input=batch,
            dimensions=dimensions,
        )
        batch_embeddings = [d.embedding for d in response.data]
        all_embeddings.extend(batch_embeddings)
    return all_embeddings

When to Re-Embed

Re-embedding your entire document corpus is expensive. Only re-embed when you change the embedding model, when documents have been significantly updated, or when your retrieval quality metrics show degradation. For incremental updates, embed only the changed documents and update the vector index incrementally.

FAQ

How much storage does an embedding require?

A single 1536-dimensional float32 embedding uses 6,144 bytes (about 6 KB). For 1 million documents, that is approximately 6 GB of raw embedding storage. Using float16 cuts this in half, and reducing dimensions to 512 brings it down to about 1 GB for the same corpus. Factor in vector database overhead (indexes, metadata), which typically adds 30–50% to the raw storage.

Should I use a self-hosted embedding model to save costs?

Self-hosted models like all-MiniLM-L6-v2 from Sentence Transformers are free per-token, but you pay for compute infrastructure. The breakeven point is typically around 10–50 million tokens per month — below that, API-based embedding is cheaper when you include GPU instance costs. Above that, self-hosting provides both cost savings and lower latency.

How do I handle embedding model migrations?

Never mix embeddings from different models in the same vector index — their vector spaces are incompatible. Plan migrations by creating a new index, batch-embedding all documents with the new model, switching the search to the new index, and then deleting the old index. Run both indexes in parallel during the transition to validate quality.

#Embeddings #CostOptimization #VectorDatabase #RAG #ModelSelection #AgenticAI #LearnAI #AIEngineering

Embedding Cost Optimization: When to Re-Embed, Cache, or Use Smaller Models

The Hidden Cost of Embeddings

Embedding Caching

Model Selection by Use Case

Dimension Reduction for Storage Savings

Batch Sizing for Throughput

When to Re-Embed

FAQ

How much storage does an embedding require?

Should I use a self-hosted embedding model to save costs?

How do I handle embedding model migrations?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding