Skip to content
Learn Agentic AI11 min read0 views

Cache Strategies for AI Agents: Avoiding Redundant LLM Calls

Master caching strategies for AI agents — from response caching and embedding caching to tool result caching and smart invalidation — to reduce latency, cut API costs, and improve throughput.

The Cost of Redundant LLM Calls

Every LLM call costs money and time. A GPT-4o call takes 1-5 seconds and costs $2.50-$10 per million tokens. When an agent repeatedly asks the same question, reformats the same data, or re-embeds identical text, those costs compound quickly. In production systems handling thousands of requests, redundant calls can account for 30-50% of total LLM spend.

Caching solves this by storing the results of expensive operations and returning the cached result when the same (or sufficiently similar) input appears again.

Layer 1: Exact Response Caching

The simplest cache matches inputs exactly. If the same prompt produces the same response, serve the cached version.

import hashlib
import json
import time
from typing import Optional, Dict, Any
from pathlib import Path

class LLMResponseCache:
    def __init__(self, ttl_seconds: int = 3600, max_size: int = 1000):
        self._cache: Dict[str, Dict[str, Any]] = {}
        self.ttl = ttl_seconds
        self.max_size = max_size

    def _make_key(self, model: str, messages: list, **kwargs) -> str:
        """Create a deterministic cache key from the request parameters."""
        key_data = {
            "model": model,
            "messages": messages,
            "temperature": kwargs.get("temperature", 1.0),
            "max_tokens": kwargs.get("max_tokens"),
        }
        serialized = json.dumps(key_data, sort_keys=True)
        return hashlib.sha256(serialized.encode()).hexdigest()

    def get(self, model: str, messages: list, **kwargs) -> Optional[str]:
        key = self._make_key(model, messages, **kwargs)
        entry = self._cache.get(key)
        if entry is None:
            return None
        if time.time() - entry["timestamp"] > self.ttl:
            del self._cache[key]
            return None
        entry["hits"] += 1
        return entry["response"]

    def set(self, model: str, messages: list, response: str, **kwargs):
        if len(self._cache) >= self.max_size:
            self._evict_oldest()
        key = self._make_key(model, messages, **kwargs)
        self._cache[key] = {
            "response": response,
            "timestamp": time.time(),
            "hits": 0,
        }

    def _evict_oldest(self):
        oldest_key = min(self._cache, key=lambda k: self._cache[k]["timestamp"])
        del self._cache[oldest_key]

# Usage wrapper
cache = LLMResponseCache(ttl_seconds=1800)

async def cached_llm_call(client, model: str, messages: list, **kwargs) -> str:
    cached = cache.get(model, messages, **kwargs)
    if cached is not None:
        return cached

    response = await client.chat.completions.create(
        model=model, messages=messages, **kwargs
    )
    result = response.choices[0].message.content
    cache.set(model, messages, result, **kwargs)
    return result

Important: Only cache calls with temperature=0 or very low temperature. High-temperature calls are intentionally non-deterministic, and caching defeats the purpose.

Layer 2: Semantic Cache

Exact matching misses opportunities. "What is the capital of France?" and "Tell me France's capital city" should return the same cached answer. A semantic cache uses embeddings to find similar past queries.

import numpy as np

class SemanticLLMCache:
    def __init__(self, similarity_threshold: float = 0.95, ttl_seconds: int = 3600):
        self.threshold = similarity_threshold
        self.ttl = ttl_seconds
        self.entries: list = []  # list of (embedding, response, timestamp)

    def _embed(self, text: str) -> list:
        """Generate embedding for the cache key text."""
        import openai
        client = openai.OpenAI()
        resp = client.embeddings.create(model="text-embedding-3-small", input=text)
        return resp.data[0].embedding

    def _similarity(self, a: list, b: list) -> float:
        a_arr, b_arr = np.array(a), np.array(b)
        return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

    def get(self, query: str) -> Optional[str]:
        query_emb = self._embed(query)
        now = time.time()
        best_match = None
        best_score = 0.0

        for emb, response, ts in self.entries:
            if now - ts > self.ttl:
                continue
            score = self._similarity(query_emb, emb)
            if score > best_score and score >= self.threshold:
                best_score = score
                best_match = response

        return best_match

    def set(self, query: str, response: str):
        emb = self._embed(query)
        self.entries.append((emb, response, time.time()))

Set the similarity threshold high (0.93-0.97) to avoid returning cached responses for genuinely different questions.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Layer 3: Tool Result Caching

Agents frequently call tools — web searches, API lookups, database queries — and many of these return the same results for identical inputs within a short time window.

import functools
from datetime import timedelta

def cached_tool(ttl_seconds: int = 300):
    """Decorator that caches tool results based on input arguments."""
    def decorator(func):
        _cache: Dict[str, Dict[str, Any]] = {}

        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            # Build cache key from function name + arguments
            key_data = {
                "func": func.__name__,
                "args": str(args),
                "kwargs": json.dumps(kwargs, sort_keys=True, default=str),
            }
            key = hashlib.sha256(
                json.dumps(key_data, sort_keys=True).encode()
            ).hexdigest()

            if key in _cache:
                entry = _cache[key]
                if time.time() - entry["ts"] < ttl_seconds:
                    return entry["result"]

            result = await func(*args, **kwargs)
            _cache[key] = {"result": result, "ts": time.time()}
            return result

        wrapper.clear_cache = lambda: _cache.clear()
        return wrapper
    return decorator

# Apply to agent tools
@cached_tool(ttl_seconds=600)
async def search_web(query: str) -> dict:
    """Search the web — results cached for 10 minutes."""
    # ... actual web search implementation
    pass

@cached_tool(ttl_seconds=60)
async def get_stock_price(symbol: str) -> float:
    """Fetch stock price — cached for 1 minute due to volatility."""
    # ... actual API call
    pass

Layer 4: Embedding Cache

If your agent embeds the same texts repeatedly (for memory retrieval, deduplication checks, etc.), an embedding cache avoids redundant API calls.

class EmbeddingCache:
    def __init__(self, max_size: int = 10000):
        self._cache: Dict[str, list] = {}
        self.max_size = max_size
        self.hits = 0
        self.misses = 0

    def get_or_compute(self, text: str, embed_fn) -> list:
        key = hashlib.md5(text.encode()).hexdigest()
        if key in self._cache:
            self.hits += 1
            return self._cache[key]

        self.misses += 1
        embedding = embed_fn(text)

        if len(self._cache) >= self.max_size:
            # Evict a random entry (simple strategy)
            self._cache.pop(next(iter(self._cache)))

        self._cache[key] = embedding
        return embedding

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

Cache Invalidation Strategies

Caching stale data causes agents to act on outdated information. Use these strategies:

  1. TTL (Time-To-Live): Set expiry times appropriate to data volatility. Stock prices: 1 minute. Company info: 1 hour. Geographic facts: 24 hours.
  2. Event-Based: Invalidate specific cache entries when you know the underlying data changed.
  3. Version Keys: Include a version number in the cache key. Increment it when you deploy new tools or update prompts.

FAQ

Does caching LLM responses risk serving outdated information?

Yes, if the underlying data changes frequently. Use short TTLs for dynamic content and longer TTLs for stable knowledge. Never cache responses that depend on real-time data (stock prices, weather) with long TTLs.

How much can caching actually save on LLM costs?

It depends on the repetition in your workload. Customer support agents handling common questions can see 40-60% cache hit rates, reducing costs proportionally. Research agents with unique queries might only see 5-10% hit rates. Monitor your cache hit rate and adjust TTLs accordingly.

Should I use Redis or an in-memory cache?

For single-process agents, in-memory caches (like the examples above) are the fastest option. For multi-process or distributed agents, use Redis — it provides shared caching across instances, persistence across restarts, and built-in TTL support with minimal overhead.


#Caching #Performance #LLMOptimization #CostReduction #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.