Cache Strategies for AI Agents: Avoiding Redundant LLM Calls
Master caching strategies for AI agents — from response caching and embedding caching to tool result caching and smart invalidation — to reduce latency, cut API costs, and improve throughput.
The Cost of Redundant LLM Calls
Every LLM call costs money and time. A GPT-4o call takes 1-5 seconds and costs $2.50-$10 per million tokens. When an agent repeatedly asks the same question, reformats the same data, or re-embeds identical text, those costs compound quickly. In production systems handling thousands of requests, redundant calls can account for 30-50% of total LLM spend.
Caching solves this by storing the results of expensive operations and returning the cached result when the same (or sufficiently similar) input appears again.
Layer 1: Exact Response Caching
The simplest cache matches inputs exactly. If the same prompt produces the same response, serve the cached version.
import hashlib
import json
import time
from typing import Optional, Dict, Any
from pathlib import Path
class LLMResponseCache:
def __init__(self, ttl_seconds: int = 3600, max_size: int = 1000):
self._cache: Dict[str, Dict[str, Any]] = {}
self.ttl = ttl_seconds
self.max_size = max_size
def _make_key(self, model: str, messages: list, **kwargs) -> str:
"""Create a deterministic cache key from the request parameters."""
key_data = {
"model": model,
"messages": messages,
"temperature": kwargs.get("temperature", 1.0),
"max_tokens": kwargs.get("max_tokens"),
}
serialized = json.dumps(key_data, sort_keys=True)
return hashlib.sha256(serialized.encode()).hexdigest()
def get(self, model: str, messages: list, **kwargs) -> Optional[str]:
key = self._make_key(model, messages, **kwargs)
entry = self._cache.get(key)
if entry is None:
return None
if time.time() - entry["timestamp"] > self.ttl:
del self._cache[key]
return None
entry["hits"] += 1
return entry["response"]
def set(self, model: str, messages: list, response: str, **kwargs):
if len(self._cache) >= self.max_size:
self._evict_oldest()
key = self._make_key(model, messages, **kwargs)
self._cache[key] = {
"response": response,
"timestamp": time.time(),
"hits": 0,
}
def _evict_oldest(self):
oldest_key = min(self._cache, key=lambda k: self._cache[k]["timestamp"])
del self._cache[oldest_key]
# Usage wrapper
cache = LLMResponseCache(ttl_seconds=1800)
async def cached_llm_call(client, model: str, messages: list, **kwargs) -> str:
cached = cache.get(model, messages, **kwargs)
if cached is not None:
return cached
response = await client.chat.completions.create(
model=model, messages=messages, **kwargs
)
result = response.choices[0].message.content
cache.set(model, messages, result, **kwargs)
return result
Important: Only cache calls with temperature=0 or very low temperature. High-temperature calls are intentionally non-deterministic, and caching defeats the purpose.
Layer 2: Semantic Cache
Exact matching misses opportunities. "What is the capital of France?" and "Tell me France's capital city" should return the same cached answer. A semantic cache uses embeddings to find similar past queries.
import numpy as np
class SemanticLLMCache:
def __init__(self, similarity_threshold: float = 0.95, ttl_seconds: int = 3600):
self.threshold = similarity_threshold
self.ttl = ttl_seconds
self.entries: list = [] # list of (embedding, response, timestamp)
def _embed(self, text: str) -> list:
"""Generate embedding for the cache key text."""
import openai
client = openai.OpenAI()
resp = client.embeddings.create(model="text-embedding-3-small", input=text)
return resp.data[0].embedding
def _similarity(self, a: list, b: list) -> float:
a_arr, b_arr = np.array(a), np.array(b)
return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
def get(self, query: str) -> Optional[str]:
query_emb = self._embed(query)
now = time.time()
best_match = None
best_score = 0.0
for emb, response, ts in self.entries:
if now - ts > self.ttl:
continue
score = self._similarity(query_emb, emb)
if score > best_score and score >= self.threshold:
best_score = score
best_match = response
return best_match
def set(self, query: str, response: str):
emb = self._embed(query)
self.entries.append((emb, response, time.time()))
Set the similarity threshold high (0.93-0.97) to avoid returning cached responses for genuinely different questions.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Layer 3: Tool Result Caching
Agents frequently call tools — web searches, API lookups, database queries — and many of these return the same results for identical inputs within a short time window.
import functools
from datetime import timedelta
def cached_tool(ttl_seconds: int = 300):
"""Decorator that caches tool results based on input arguments."""
def decorator(func):
_cache: Dict[str, Dict[str, Any]] = {}
@functools.wraps(func)
async def wrapper(*args, **kwargs):
# Build cache key from function name + arguments
key_data = {
"func": func.__name__,
"args": str(args),
"kwargs": json.dumps(kwargs, sort_keys=True, default=str),
}
key = hashlib.sha256(
json.dumps(key_data, sort_keys=True).encode()
).hexdigest()
if key in _cache:
entry = _cache[key]
if time.time() - entry["ts"] < ttl_seconds:
return entry["result"]
result = await func(*args, **kwargs)
_cache[key] = {"result": result, "ts": time.time()}
return result
wrapper.clear_cache = lambda: _cache.clear()
return wrapper
return decorator
# Apply to agent tools
@cached_tool(ttl_seconds=600)
async def search_web(query: str) -> dict:
"""Search the web — results cached for 10 minutes."""
# ... actual web search implementation
pass
@cached_tool(ttl_seconds=60)
async def get_stock_price(symbol: str) -> float:
"""Fetch stock price — cached for 1 minute due to volatility."""
# ... actual API call
pass
Layer 4: Embedding Cache
If your agent embeds the same texts repeatedly (for memory retrieval, deduplication checks, etc.), an embedding cache avoids redundant API calls.
class EmbeddingCache:
def __init__(self, max_size: int = 10000):
self._cache: Dict[str, list] = {}
self.max_size = max_size
self.hits = 0
self.misses = 0
def get_or_compute(self, text: str, embed_fn) -> list:
key = hashlib.md5(text.encode()).hexdigest()
if key in self._cache:
self.hits += 1
return self._cache[key]
self.misses += 1
embedding = embed_fn(text)
if len(self._cache) >= self.max_size:
# Evict a random entry (simple strategy)
self._cache.pop(next(iter(self._cache)))
self._cache[key] = embedding
return embedding
@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0
Cache Invalidation Strategies
Caching stale data causes agents to act on outdated information. Use these strategies:
- TTL (Time-To-Live): Set expiry times appropriate to data volatility. Stock prices: 1 minute. Company info: 1 hour. Geographic facts: 24 hours.
- Event-Based: Invalidate specific cache entries when you know the underlying data changed.
- Version Keys: Include a version number in the cache key. Increment it when you deploy new tools or update prompts.
FAQ
Does caching LLM responses risk serving outdated information?
Yes, if the underlying data changes frequently. Use short TTLs for dynamic content and longer TTLs for stable knowledge. Never cache responses that depend on real-time data (stock prices, weather) with long TTLs.
How much can caching actually save on LLM costs?
It depends on the repetition in your workload. Customer support agents handling common questions can see 40-60% cache hit rates, reducing costs proportionally. Research agents with unique queries might only see 5-10% hit rates. Monitor your cache hit rate and adjust TTLs accordingly.
Should I use Redis or an in-memory cache?
For single-process agents, in-memory caches (like the examples above) are the fastest option. For multi-process or distributed agents, use Redis — it provides shared caching across instances, persistence across restarts, and built-in TTL support with minimal overhead.
#Caching #Performance #LLMOptimization #CostReduction #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.