Caching Strategies That Cut AI Agent Costs: Semantic, Exact, and Hybrid Caching

Why Standard Caching Falls Short for AI Agents

Traditional exact-match caching works well for deterministic APIs, but AI agents present a unique challenge: semantically identical questions get asked in different ways. "What are your hours?" and "When are you open?" should return the same cached response, but a hash-based cache treats them as completely different keys.

To solve this, you need a caching strategy that combines exact matching for high-frequency identical queries with semantic matching for paraphrased queries.

Exact-Match Caching with Redis

Start with exact-match caching for the cheapest wins. Many agent systems receive large volumes of identical queries.

import hashlib
import json
import time
from typing import Optional
import redis

class ExactMatchCache:
    def __init__(self, redis_url: str = "redis://localhost:6379/0", ttl: int = 3600):
        self.redis_client = redis.from_url(redis_url)
        self.ttl = ttl
        self.hits = 0
        self.misses = 0

    def _make_key(self, prompt: str, model: str) -> str:
        normalized = prompt.strip().lower()
        content = f"{model}:{normalized}"
        return f"llm_cache:{hashlib.sha256(content.encode()).hexdigest()}"

    def get(self, prompt: str, model: str) -> Optional[dict]:
        key = self._make_key(prompt, model)
        cached = self.redis_client.get(key)
        if cached:
            self.hits += 1
            return json.loads(cached)
        self.misses += 1
        return None

    def set(self, prompt: str, model: str, response: dict):
        key = self._make_key(prompt, model)
        self.redis_client.setex(key, self.ttl, json.dumps(response))

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

Semantic Caching with Embeddings

Semantic caching matches queries by meaning rather than exact text. Compute an embedding for each query, then search for similar cached queries within a distance threshold.

import numpy as np
from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class CacheEntry:
    query: str
    embedding: np.ndarray
    response: dict
    created_at: float
    access_count: int = 0

class SemanticCache:
    def __init__(
        self,
        similarity_threshold: float = 0.92,
        max_entries: int = 10000,
    ):
        self.threshold = similarity_threshold
        self.max_entries = max_entries
        self.entries: List[CacheEntry] = []

    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def search(self, query_embedding: np.ndarray) -> Optional[dict]:
        best_score = 0.0
        best_entry = None
        for entry in self.entries:
            score = self._cosine_similarity(query_embedding, entry.embedding)
            if score > best_score:
                best_score = score
                best_entry = entry
        if best_entry and best_score >= self.threshold:
            best_entry.access_count += 1
            return best_entry.response
        return None

    def store(self, query: str, embedding: np.ndarray, response: dict):
        if len(self.entries) >= self.max_entries:
            self.entries.sort(key=lambda e: e.access_count)
            self.entries.pop(0)
        self.entries.append(CacheEntry(
            query=query,
            embedding=embedding,
            response=response,
            created_at=time.time(),
        ))

Hybrid Caching: Best of Both

Combine exact and semantic caching in a layered architecture. Check exact match first (fastest), then semantic match, and only call the LLM on a full miss.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class HybridCache:
    def __init__(self, exact_cache: ExactMatchCache, semantic_cache: SemanticCache):
        self.exact = exact_cache
        self.semantic = semantic_cache
        self.stats = {"exact_hits": 0, "semantic_hits": 0, "misses": 0}

    def get(self, query: str, model: str, query_embedding: np.ndarray) -> Optional[dict]:
        exact_result = self.exact.get(query, model)
        if exact_result:
            self.stats["exact_hits"] += 1
            return exact_result
        semantic_result = self.semantic.search(query_embedding)
        if semantic_result:
            self.stats["semantic_hits"] += 1
            self.exact.set(query, model, semantic_result)
            return semantic_result
        self.stats["misses"] += 1
        return None

    def store(self, query: str, model: str, embedding: np.ndarray, response: dict):
        self.exact.set(query, model, response)
        self.semantic.store(query, embedding, response)

    def cost_savings_report(self, avg_cost_per_call: float) -> dict:
        total_hits = self.stats["exact_hits"] + self.stats["semantic_hits"]
        total = total_hits + self.stats["misses"]
        return {
            "total_requests": total,
            "cache_hit_rate": round(total_hits / total * 100, 1) if total else 0,
            "estimated_savings": round(total_hits * avg_cost_per_call, 2),
            "breakdown": self.stats.copy(),
        }

Cache Invalidation Strategies

Stale caches are worse than no cache at all for agent systems. Implement time-based TTL for general freshness, event-driven invalidation when underlying data changes, and version-based invalidation when system prompts or tools are updated.

class VersionedCache(ExactMatchCache):
    def __init__(self, version: str, **kwargs):
        super().__init__(**kwargs)
        self.version = version

    def _make_key(self, prompt: str, model: str) -> str:
        normalized = prompt.strip().lower()
        content = f"{self.version}:{model}:{normalized}"
        return f"llm_cache:{hashlib.sha256(content.encode()).hexdigest()}"

FAQ

What similarity threshold should I use for semantic caching?

Start with 0.92–0.95 cosine similarity. Below 0.90, you risk returning incorrect cached answers for queries that are similar but have different intents. Above 0.96, the cache rarely hits because the threshold is too strict. Monitor cache hit rate and error rate to tune this value for your domain.

How do I handle personalized responses with caching?

Separate the cacheable components from personalized components. Cache the factual content (product info, policies, documentation) and inject personalization at response assembly time. For example, cache the answer to "How do I reset my password?" but inject the user’s name and account type dynamically.

What is a good cache hit rate target for AI agents?

A 30–50% hit rate is typical for customer support agents where many users ask similar questions. Internal knowledge assistants may achieve 50–70%. If your hit rate is below 20%, check whether your semantic similarity threshold is too strict or your cache TTL is too short.

#Caching #SemanticCache #CostReduction #Redis #AIArchitecture #AgenticAI #LearnAI #AIEngineering

Caching Strategies That Cut AI Agent Costs: Semantic, Exact, and Hybrid Caching

Why Standard Caching Falls Short for AI Agents

Exact-Match Caching with Redis

Semantic Caching with Embeddings

Hybrid Caching: Best of Both

Cache Invalidation Strategies

FAQ

What similarity threshold should I use for semantic caching?

How do I handle personalized responses with caching?

What is a good cache hit rate target for AI agents?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding