Caching Architecture for AI Agents: Redis, Memcached, and Application-Level Caching
Design a multi-layer caching architecture for AI agent systems using Redis, application-level caches, and TTL strategies to reduce latency and LLM API costs while preventing cache stampedes and stale data problems.
The Case for Aggressive Caching in AI Agent Systems
AI agent systems have a unique cost profile: every LLM call costs money and adds latency. A single agent turn might involve a tool call that fetches the same reference data that 100 other concurrent sessions also need. Without caching, you pay the database query cost and network latency for every identical request.
Effective caching in AI agent platforms operates at three layers: application-level in-process caching for hot configuration data, Redis for shared session and response caching across pods, and semantic caching for similar (not identical) LLM queries.
Layer 1: Application-Level Caching
Use in-process caching for data that changes infrequently and is read on every agent turn — prompt templates, tool definitions, model configurations:
from functools import lru_cache
from datetime import datetime, timedelta
import time
class TTLCache:
"""Simple TTL cache for configuration data."""
def __init__(self, ttl_seconds: int = 300):
self._cache: dict = {}
self._expiry: dict = {}
self._ttl = ttl_seconds
def get(self, key: str):
if key in self._cache:
if time.time() < self._expiry[key]:
return self._cache[key]
del self._cache[key]
del self._expiry[key]
return None
def set(self, key: str, value):
self._cache[key] = value
self._expiry[key] = time.time() + self._ttl
# Global instance shared across requests in one process
config_cache = TTLCache(ttl_seconds=300)
async def get_prompt_template(template_id: str) -> str:
cached = config_cache.get(f"prompt:{template_id}")
if cached is not None:
return cached
template = await db.fetch_prompt_template(template_id)
config_cache.set(f"prompt:{template_id}", template)
return template
This avoids a database round-trip on every single agent turn for data that only changes when an admin updates a template. The five-minute TTL ensures updates propagate without requiring cache invalidation signals.
Layer 2: Redis for Shared State
Redis caches data that multiple pods need access to — session context, user preferences, frequently accessed knowledge base entries:
import redis.asyncio as redis
import json
import hashlib
redis_client = redis.Redis(
host="redis-cluster",
port=6379,
decode_responses=True,
)
async def cached_tool_result(
tool_name: str, params: dict, ttl: int = 600
) -> dict | None:
"""Cache tool results that are deterministic."""
cache_key = f"tool:{tool_name}:{_hash_params(params)}"
cached = await redis_client.get(cache_key)
if cached:
return json.loads(cached)
return None
async def store_tool_result(
tool_name: str, params: dict, result: dict, ttl: int = 600
):
cache_key = f"tool:{tool_name}:{_hash_params(params)}"
await redis_client.setex(cache_key, ttl, json.dumps(result))
def _hash_params(params: dict) -> str:
serialized = json.dumps(params, sort_keys=True)
return hashlib.sha256(serialized.encode()).hexdigest()[:16]
For AI agents, caching tool call results is extremely high-value. If 50 concurrent sessions all ask "What are our business hours?" and the agent calls a get_business_info tool, only the first call actually executes — the other 49 get the cached result instantly.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Layer 3: Semantic Caching for LLM Responses
Semantic caching goes beyond exact-match caching. If one user asks "What is your return policy?" and another asks "How do I return an item?", the underlying LLM call is essentially the same. Use embedding similarity to match semantically equivalent queries:
import numpy as np
SIMILARITY_THRESHOLD = 0.95
async def semantic_cache_lookup(
query: str, namespace: str = "default"
) -> str | None:
query_embedding = await get_embedding(query)
# Search Redis for similar cached queries
results = await vector_search(
namespace=namespace,
vector=query_embedding,
top_k=1,
)
if results and results[0]["score"] >= SIMILARITY_THRESHOLD:
return results[0]["response"]
return None
async def semantic_cache_store(
query: str, response: str, namespace: str = "default", ttl: int = 3600
):
query_embedding = await get_embedding(query)
cache_key = _hash_params({"query": query, "ns": namespace})
await store_vector(
namespace=namespace,
key=cache_key,
vector=query_embedding,
metadata={"response": response},
ttl=ttl,
)
This can reduce LLM API calls by 30 to 60 percent for customer-facing agents where many users ask similar questions.
Preventing Cache Stampedes
A cache stampede occurs when a popular cache entry expires and hundreds of concurrent requests all try to regenerate it simultaneously. For AI agents, this means hundreds of identical LLM calls or database queries firing at once:
import asyncio
_locks: dict[str, asyncio.Lock] = {}
async def get_with_lock(key: str, generator, ttl: int = 600):
"""Fetch from cache with single-flight protection."""
cached = await redis_client.get(key)
if cached:
return json.loads(cached)
if key not in _locks:
_locks[key] = asyncio.Lock()
async with _locks[key]:
# Double-check after acquiring lock
cached = await redis_client.get(key)
if cached:
return json.loads(cached)
result = await generator()
await redis_client.setex(key, ttl, json.dumps(result))
return result
The lock ensures only one coroutine generates the value while others wait. Combined with early expiration (refresh the cache before it actually expires), this eliminates stampedes entirely.
FAQ
What TTL should I use for cached LLM responses?
It depends on data volatility. For static knowledge base answers, use 1 to 24 hours. For responses that depend on real-time data (stock prices, appointment availability), use 30 to 60 seconds or skip caching entirely. For tool call results, match the TTL to how often the underlying data changes.
Should I use Redis or Memcached for AI agent caching?
Use Redis. It supports data structures (sorted sets for leaderboards, lists for conversation history), pub/sub for cache invalidation, and persistence for surviving restarts. Memcached is simpler but lacks these features that AI agent platforms commonly need.
How do I invalidate cached tool results when underlying data changes?
Use a cache key prefix that includes a version number or timestamp. When the underlying data changes, increment the version in the key namespace. Alternatively, publish an invalidation event via Redis pub/sub that all pods subscribe to, and delete the specific cache keys.
#Caching #Redis #AIAgents #Performance #TTLStrategies #CacheInvalidation #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.