Caching Strategies That Cut AI Agent Costs: Semantic, Exact, and Hybrid Caching
Learn how to implement exact-match, semantic, and hybrid caching for AI agent responses. Achieve 30-60% cost reduction with proper cache architecture, hit rate optimization, and smart invalidation strategies.
Why Standard Caching Falls Short for AI Agents
Traditional exact-match caching works well for deterministic APIs, but AI agents present a unique challenge: semantically identical questions get asked in different ways. "What are your hours?" and "When are you open?" should return the same cached response, but a hash-based cache treats them as completely different keys.
To solve this, you need a caching strategy that combines exact matching for high-frequency identical queries with semantic matching for paraphrased queries.
Exact-Match Caching with Redis
Start with exact-match caching for the cheapest wins. Many agent systems receive large volumes of identical queries.
import hashlib
import json
import time
from typing import Optional
import redis
class ExactMatchCache:
def __init__(self, redis_url: str = "redis://localhost:6379/0", ttl: int = 3600):
self.redis_client = redis.from_url(redis_url)
self.ttl = ttl
self.hits = 0
self.misses = 0
def _make_key(self, prompt: str, model: str) -> str:
normalized = prompt.strip().lower()
content = f"{model}:{normalized}"
return f"llm_cache:{hashlib.sha256(content.encode()).hexdigest()}"
def get(self, prompt: str, model: str) -> Optional[dict]:
key = self._make_key(prompt, model)
cached = self.redis_client.get(key)
if cached:
self.hits += 1
return json.loads(cached)
self.misses += 1
return None
def set(self, prompt: str, model: str, response: dict):
key = self._make_key(prompt, model)
self.redis_client.setex(key, self.ttl, json.dumps(response))
@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0
Semantic Caching with Embeddings
Semantic caching matches queries by meaning rather than exact text. Compute an embedding for each query, then search for similar cached queries within a distance threshold.
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple
@dataclass
class CacheEntry:
query: str
embedding: np.ndarray
response: dict
created_at: float
access_count: int = 0
class SemanticCache:
def __init__(
self,
similarity_threshold: float = 0.92,
max_entries: int = 10000,
):
self.threshold = similarity_threshold
self.max_entries = max_entries
self.entries: List[CacheEntry] = []
def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def search(self, query_embedding: np.ndarray) -> Optional[dict]:
best_score = 0.0
best_entry = None
for entry in self.entries:
score = self._cosine_similarity(query_embedding, entry.embedding)
if score > best_score:
best_score = score
best_entry = entry
if best_entry and best_score >= self.threshold:
best_entry.access_count += 1
return best_entry.response
return None
def store(self, query: str, embedding: np.ndarray, response: dict):
if len(self.entries) >= self.max_entries:
self.entries.sort(key=lambda e: e.access_count)
self.entries.pop(0)
self.entries.append(CacheEntry(
query=query,
embedding=embedding,
response=response,
created_at=time.time(),
))
Hybrid Caching: Best of Both
Combine exact and semantic caching in a layered architecture. Check exact match first (fastest), then semantic match, and only call the LLM on a full miss.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class HybridCache:
def __init__(self, exact_cache: ExactMatchCache, semantic_cache: SemanticCache):
self.exact = exact_cache
self.semantic = semantic_cache
self.stats = {"exact_hits": 0, "semantic_hits": 0, "misses": 0}
def get(self, query: str, model: str, query_embedding: np.ndarray) -> Optional[dict]:
exact_result = self.exact.get(query, model)
if exact_result:
self.stats["exact_hits"] += 1
return exact_result
semantic_result = self.semantic.search(query_embedding)
if semantic_result:
self.stats["semantic_hits"] += 1
self.exact.set(query, model, semantic_result)
return semantic_result
self.stats["misses"] += 1
return None
def store(self, query: str, model: str, embedding: np.ndarray, response: dict):
self.exact.set(query, model, response)
self.semantic.store(query, embedding, response)
def cost_savings_report(self, avg_cost_per_call: float) -> dict:
total_hits = self.stats["exact_hits"] + self.stats["semantic_hits"]
total = total_hits + self.stats["misses"]
return {
"total_requests": total,
"cache_hit_rate": round(total_hits / total * 100, 1) if total else 0,
"estimated_savings": round(total_hits * avg_cost_per_call, 2),
"breakdown": self.stats.copy(),
}
Cache Invalidation Strategies
Stale caches are worse than no cache at all for agent systems. Implement time-based TTL for general freshness, event-driven invalidation when underlying data changes, and version-based invalidation when system prompts or tools are updated.
class VersionedCache(ExactMatchCache):
def __init__(self, version: str, **kwargs):
super().__init__(**kwargs)
self.version = version
def _make_key(self, prompt: str, model: str) -> str:
normalized = prompt.strip().lower()
content = f"{self.version}:{model}:{normalized}"
return f"llm_cache:{hashlib.sha256(content.encode()).hexdigest()}"
FAQ
What similarity threshold should I use for semantic caching?
Start with 0.92–0.95 cosine similarity. Below 0.90, you risk returning incorrect cached answers for queries that are similar but have different intents. Above 0.96, the cache rarely hits because the threshold is too strict. Monitor cache hit rate and error rate to tune this value for your domain.
How do I handle personalized responses with caching?
Separate the cacheable components from personalized components. Cache the factual content (product info, policies, documentation) and inject personalization at response assembly time. For example, cache the answer to "How do I reset my password?" but inject the user’s name and account type dynamically.
What is a good cache hit rate target for AI agents?
A 30–50% hit rate is typical for customer support agents where many users ask similar questions. Internal knowledge assistants may achieve 50–70%. If your hit rate is below 20%, check whether your semantic similarity threshold is too strict or your cache TTL is too short.
#Caching #SemanticCache #CostReduction #Redis #AIArchitecture #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.