Caching Architecture for AI Agents: Redis, Memcached, and Application-Level Caching

The Case for Aggressive Caching in AI Agent Systems

AI agent systems have a unique cost profile: every LLM call costs money and adds latency. A single agent turn might involve a tool call that fetches the same reference data that 100 other concurrent sessions also need. Without caching, you pay the database query cost and network latency for every identical request.

Effective caching in AI agent platforms operates at three layers: application-level in-process caching for hot configuration data, Redis for shared session and response caching across pods, and semantic caching for similar (not identical) LLM queries.

Layer 1: Application-Level Caching

Use in-process caching for data that changes infrequently and is read on every agent turn — prompt templates, tool definitions, model configurations:

from functools import lru_cache
from datetime import datetime, timedelta
import time

class TTLCache:
    """Simple TTL cache for configuration data."""

    def __init__(self, ttl_seconds: int = 300):
        self._cache: dict = {}
        self._expiry: dict = {}
        self._ttl = ttl_seconds

    def get(self, key: str):
        if key in self._cache:
            if time.time() < self._expiry[key]:
                return self._cache[key]
            del self._cache[key]
            del self._expiry[key]
        return None

    def set(self, key: str, value):
        self._cache[key] = value
        self._expiry[key] = time.time() + self._ttl

# Global instance shared across requests in one process
config_cache = TTLCache(ttl_seconds=300)

async def get_prompt_template(template_id: str) -> str:
    cached = config_cache.get(f"prompt:{template_id}")
    if cached is not None:
        return cached

    template = await db.fetch_prompt_template(template_id)
    config_cache.set(f"prompt:{template_id}", template)
    return template

This avoids a database round-trip on every single agent turn for data that only changes when an admin updates a template. The five-minute TTL ensures updates propagate without requiring cache invalidation signals.

Layer 2: Redis for Shared State

Redis caches data that multiple pods need access to — session context, user preferences, frequently accessed knowledge base entries:

import redis.asyncio as redis
import json
import hashlib

redis_client = redis.Redis(
    host="redis-cluster",
    port=6379,
    decode_responses=True,
)

async def cached_tool_result(
    tool_name: str, params: dict, ttl: int = 600
) -> dict | None:
    """Cache tool results that are deterministic."""
    cache_key = f"tool:{tool_name}:{_hash_params(params)}"
    cached = await redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    return None

async def store_tool_result(
    tool_name: str, params: dict, result: dict, ttl: int = 600
):
    cache_key = f"tool:{tool_name}:{_hash_params(params)}"
    await redis_client.setex(cache_key, ttl, json.dumps(result))

def _hash_params(params: dict) -> str:
    serialized = json.dumps(params, sort_keys=True)
    return hashlib.sha256(serialized.encode()).hexdigest()[:16]

For AI agents, caching tool call results is extremely high-value. If 50 concurrent sessions all ask "What are our business hours?" and the agent calls a get_business_info tool, only the first call actually executes — the other 49 get the cached result instantly.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Layer 3: Semantic Caching for LLM Responses

Semantic caching goes beyond exact-match caching. If one user asks "What is your return policy?" and another asks "How do I return an item?", the underlying LLM call is essentially the same. Use embedding similarity to match semantically equivalent queries:

import numpy as np

SIMILARITY_THRESHOLD = 0.95

async def semantic_cache_lookup(
    query: str, namespace: str = "default"
) -> str | None:
    query_embedding = await get_embedding(query)

    # Search Redis for similar cached queries
    results = await vector_search(
        namespace=namespace,
        vector=query_embedding,
        top_k=1,
    )

    if results and results[0]["score"] >= SIMILARITY_THRESHOLD:
        return results[0]["response"]
    return None

async def semantic_cache_store(
    query: str, response: str, namespace: str = "default", ttl: int = 3600
):
    query_embedding = await get_embedding(query)
    cache_key = _hash_params({"query": query, "ns": namespace})
    await store_vector(
        namespace=namespace,
        key=cache_key,
        vector=query_embedding,
        metadata={"response": response},
        ttl=ttl,
    )

This can reduce LLM API calls by 30 to 60 percent for customer-facing agents where many users ask similar questions.

Preventing Cache Stampedes

A cache stampede occurs when a popular cache entry expires and hundreds of concurrent requests all try to regenerate it simultaneously. For AI agents, this means hundreds of identical LLM calls or database queries firing at once:

import asyncio

_locks: dict[str, asyncio.Lock] = {}

async def get_with_lock(key: str, generator, ttl: int = 600):
    """Fetch from cache with single-flight protection."""
    cached = await redis_client.get(key)
    if cached:
        return json.loads(cached)

    if key not in _locks:
        _locks[key] = asyncio.Lock()

    async with _locks[key]:
        # Double-check after acquiring lock
        cached = await redis_client.get(key)
        if cached:
            return json.loads(cached)

        result = await generator()
        await redis_client.setex(key, ttl, json.dumps(result))
        return result

The lock ensures only one coroutine generates the value while others wait. Combined with early expiration (refresh the cache before it actually expires), this eliminates stampedes entirely.

FAQ

What TTL should I use for cached LLM responses?

It depends on data volatility. For static knowledge base answers, use 1 to 24 hours. For responses that depend on real-time data (stock prices, appointment availability), use 30 to 60 seconds or skip caching entirely. For tool call results, match the TTL to how often the underlying data changes.

Should I use Redis or Memcached for AI agent caching?

Use Redis. It supports data structures (sorted sets for leaderboards, lists for conversation history), pub/sub for cache invalidation, and persistence for surviving restarts. Memcached is simpler but lacks these features that AI agent platforms commonly need.

How do I invalidate cached tool results when underlying data changes?

Use a cache key prefix that includes a version number or timestamp. When the underlying data changes, increment the version in the key namespace. Alternatively, publish an invalidation event via Redis pub/sub that all pods subscribe to, and delete the specific cache keys.

#Caching #Redis #AIAgents #Performance #TTLStrategies #CacheInvalidation #AgenticAI #LearnAI #AIEngineering

Caching Architecture for AI Agents: Redis, Memcached, and Application-Level Caching

The Case for Aggressive Caching in AI Agent Systems

Layer 1: Application-Level Caching

Layer 2: Redis for Shared State

Layer 3: Semantic Caching for LLM Responses

Preventing Cache Stampedes

FAQ

What TTL should I use for cached LLM responses?

Should I use Redis or Memcached for AI agent caching?

How do I invalidate cached tool results when underlying data changes?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding