Skip to content
Learn Agentic AI11 min read0 views

Caching Strategies That Cut AI Agent Costs: Semantic, Exact, and Hybrid Caching

Learn how to implement exact-match, semantic, and hybrid caching for AI agent responses. Achieve 30-60% cost reduction with proper cache architecture, hit rate optimization, and smart invalidation strategies.

Why Standard Caching Falls Short for AI Agents

Traditional exact-match caching works well for deterministic APIs, but AI agents present a unique challenge: semantically identical questions get asked in different ways. "What are your hours?" and "When are you open?" should return the same cached response, but a hash-based cache treats them as completely different keys.

To solve this, you need a caching strategy that combines exact matching for high-frequency identical queries with semantic matching for paraphrased queries.

Exact-Match Caching with Redis

Start with exact-match caching for the cheapest wins. Many agent systems receive large volumes of identical queries.

import hashlib
import json
import time
from typing import Optional
import redis

class ExactMatchCache:
    def __init__(self, redis_url: str = "redis://localhost:6379/0", ttl: int = 3600):
        self.redis_client = redis.from_url(redis_url)
        self.ttl = ttl
        self.hits = 0
        self.misses = 0

    def _make_key(self, prompt: str, model: str) -> str:
        normalized = prompt.strip().lower()
        content = f"{model}:{normalized}"
        return f"llm_cache:{hashlib.sha256(content.encode()).hexdigest()}"

    def get(self, prompt: str, model: str) -> Optional[dict]:
        key = self._make_key(prompt, model)
        cached = self.redis_client.get(key)
        if cached:
            self.hits += 1
            return json.loads(cached)
        self.misses += 1
        return None

    def set(self, prompt: str, model: str, response: dict):
        key = self._make_key(prompt, model)
        self.redis_client.setex(key, self.ttl, json.dumps(response))

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

Semantic Caching with Embeddings

Semantic caching matches queries by meaning rather than exact text. Compute an embedding for each query, then search for similar cached queries within a distance threshold.

import numpy as np
from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class CacheEntry:
    query: str
    embedding: np.ndarray
    response: dict
    created_at: float
    access_count: int = 0

class SemanticCache:
    def __init__(
        self,
        similarity_threshold: float = 0.92,
        max_entries: int = 10000,
    ):
        self.threshold = similarity_threshold
        self.max_entries = max_entries
        self.entries: List[CacheEntry] = []

    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def search(self, query_embedding: np.ndarray) -> Optional[dict]:
        best_score = 0.0
        best_entry = None
        for entry in self.entries:
            score = self._cosine_similarity(query_embedding, entry.embedding)
            if score > best_score:
                best_score = score
                best_entry = entry
        if best_entry and best_score >= self.threshold:
            best_entry.access_count += 1
            return best_entry.response
        return None

    def store(self, query: str, embedding: np.ndarray, response: dict):
        if len(self.entries) >= self.max_entries:
            self.entries.sort(key=lambda e: e.access_count)
            self.entries.pop(0)
        self.entries.append(CacheEntry(
            query=query,
            embedding=embedding,
            response=response,
            created_at=time.time(),
        ))

Hybrid Caching: Best of Both

Combine exact and semantic caching in a layered architecture. Check exact match first (fastest), then semantic match, and only call the LLM on a full miss.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class HybridCache:
    def __init__(self, exact_cache: ExactMatchCache, semantic_cache: SemanticCache):
        self.exact = exact_cache
        self.semantic = semantic_cache
        self.stats = {"exact_hits": 0, "semantic_hits": 0, "misses": 0}

    def get(self, query: str, model: str, query_embedding: np.ndarray) -> Optional[dict]:
        exact_result = self.exact.get(query, model)
        if exact_result:
            self.stats["exact_hits"] += 1
            return exact_result
        semantic_result = self.semantic.search(query_embedding)
        if semantic_result:
            self.stats["semantic_hits"] += 1
            self.exact.set(query, model, semantic_result)
            return semantic_result
        self.stats["misses"] += 1
        return None

    def store(self, query: str, model: str, embedding: np.ndarray, response: dict):
        self.exact.set(query, model, response)
        self.semantic.store(query, embedding, response)

    def cost_savings_report(self, avg_cost_per_call: float) -> dict:
        total_hits = self.stats["exact_hits"] + self.stats["semantic_hits"]
        total = total_hits + self.stats["misses"]
        return {
            "total_requests": total,
            "cache_hit_rate": round(total_hits / total * 100, 1) if total else 0,
            "estimated_savings": round(total_hits * avg_cost_per_call, 2),
            "breakdown": self.stats.copy(),
        }

Cache Invalidation Strategies

Stale caches are worse than no cache at all for agent systems. Implement time-based TTL for general freshness, event-driven invalidation when underlying data changes, and version-based invalidation when system prompts or tools are updated.

class VersionedCache(ExactMatchCache):
    def __init__(self, version: str, **kwargs):
        super().__init__(**kwargs)
        self.version = version

    def _make_key(self, prompt: str, model: str) -> str:
        normalized = prompt.strip().lower()
        content = f"{self.version}:{model}:{normalized}"
        return f"llm_cache:{hashlib.sha256(content.encode()).hexdigest()}"

FAQ

What similarity threshold should I use for semantic caching?

Start with 0.92–0.95 cosine similarity. Below 0.90, you risk returning incorrect cached answers for queries that are similar but have different intents. Above 0.96, the cache rarely hits because the threshold is too strict. Monitor cache hit rate and error rate to tune this value for your domain.

How do I handle personalized responses with caching?

Separate the cacheable components from personalized components. Cache the factual content (product info, policies, documentation) and inject personalization at response assembly time. For example, cache the answer to "How do I reset my password?" but inject the user’s name and account type dynamically.

What is a good cache hit rate target for AI agents?

A 30–50% hit rate is typical for customer support agents where many users ask similar questions. Internal knowledge assistants may achieve 50–70%. If your hit rate is below 20%, check whether your semantic similarity threshold is too strict or your cache TTL is too short.


#Caching #SemanticCache #CostReduction #Redis #AIArchitecture #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.