Prompt Caching Strategies: Reducing Latency and Cost with Cached Prefixes

The Hidden Cost of Repeated Prefixes

In production LLM applications, the same text gets sent to the model thousands of times per day. Your system prompt, few-shot examples, tool definitions, and retrieval context templates are largely identical across requests. Every time you send this prefix, the model processes it from scratch — computing attention over the same tokens it processed moments ago.

Prompt caching eliminates this redundancy. Both OpenAI and Anthropic now offer server-side caching where the model stores the computed key-value (KV) cache for prompt prefixes. When a subsequent request shares the same prefix, the model skips recomputation and starts generating immediately.

The impact is substantial: OpenAI's prompt caching offers 50 percent cost reduction on cached tokens and up to 80 percent latency reduction. Anthropic's caching charges a small write fee for the first request but then offers 90 percent savings on cached reads.

How OpenAI Prompt Caching Works

OpenAI's prompt caching is automatic for supported models. When a request shares a prefix of at least 1024 tokens with a recent request, the cached portion is served at half price:

import openai

client = openai.OpenAI()

# This long system prompt will be cached after the first request
SYSTEM_PROMPT = """You are a financial analysis assistant for Acme Corp.

## Company Context
Acme Corp is a mid-cap technology company with the following key metrics:
- Revenue: $2.4B (FY 2025)
- Operating margin: 18.3%
- Employee count: 12,400
...
(imagine 2000+ tokens of company context, policies, and instructions)

## Analysis Guidelines
1. Always cite specific numbers from the provided data
2. Compare metrics to industry benchmarks
3. Flag any year-over-year changes exceeding 15%
4. Present findings in order of business impact
"""

def analyze_financial_data(user_query: str) -> dict:
    """Query with cached system prompt prefix."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_query},
        ],
    )
    usage = response.usage
    return {
        "answer": response.choices[0].message.content,
        "cached_tokens": getattr(usage, "prompt_tokens_details", {})
            .get("cached_tokens", 0),
        "total_prompt_tokens": usage.prompt_tokens,
    }

# First call: full processing (no cache)
result1 = analyze_financial_data("What is the revenue trend?")
print(f"Cached: {result1['cached_tokens']} / {result1['total_prompt_tokens']}")

# Subsequent calls: prefix is cached
result2 = analyze_financial_data("Analyze operating margins.")
print(f"Cached: {result2['cached_tokens']} / {result2['total_prompt_tokens']}")

Designing Cache-Friendly Prompts

The critical insight is that caching works on prefixes — the matching starts from the first token. Any change at the beginning of the prompt invalidates the entire cache. This means you should structure your prompts with static content first and dynamic content last:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def build_cache_friendly_prompt(
    static_instructions: str,
    static_examples: list[str],
    dynamic_context: str,
    user_query: str,
) -> list[dict]:
    """Structure prompt for maximum cache reuse."""
    # Static prefix — identical across requests, cached
    system_content = (
        f"{static_instructions}\n\n"
        "## Examples\n\n"
        + "\n\n".join(static_examples)
    )

    # Dynamic content — changes per request, not cached
    user_content = (
        f"## Context\n\n{dynamic_context}\n\n"
        f"## Question\n\n{user_query}"
    )

    return [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content},
    ]

# Anti-pattern: dynamic content in system prompt breaks cache
def bad_prompt_design(user_id: str, query: str) -> list[dict]:
    """This breaks caching because user_id changes per request."""
    return [
        {"role": "system", "content": f"User ID: {user_id}\n{SYSTEM_PROMPT}"},
        {"role": "user", "content": query},
    ]

# Better: move dynamic content after the static prefix
def good_prompt_design(user_id: str, query: str) -> list[dict]:
    """Static prefix stays cacheable, dynamic content is appended."""
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"[User: {user_id}] {query}"},
    ]

Anthropic's Explicit Cache Control

Anthropic takes a different approach with explicit cache breakpoints. You mark exactly where in the prompt the cache should apply:

import anthropic

anthropic_client = anthropic.Anthropic()

def cached_anthropic_query(
    static_context: str,
    user_query: str,
) -> dict:
    """Use Anthropic's explicit cache control."""
    response = anthropic_client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": static_context,
                "cache_control": {"type": "ephemeral"},
            },
        ],
        messages=[
            {"role": "user", "content": user_query},
        ],
    )
    return {
        "answer": response.content[0].text,
        "input_tokens": response.usage.input_tokens,
        "cache_read_tokens": getattr(
            response.usage, "cache_read_input_tokens", 0
        ),
        "cache_write_tokens": getattr(
            response.usage, "cache_creation_input_tokens", 0
        ),
    }

Measuring Cache Effectiveness

Track your cache hit rate to validate that your prompt design is actually benefiting from caching:

class CacheMetrics:
    """Track prompt caching effectiveness over time."""

    def __init__(self):
        self.total_requests = 0
        self.total_prompt_tokens = 0
        self.total_cached_tokens = 0

    def record(self, prompt_tokens: int, cached_tokens: int):
        self.total_requests += 1
        self.total_prompt_tokens += prompt_tokens
        self.total_cached_tokens += cached_tokens

    @property
    def cache_hit_rate(self) -> float:
        if self.total_prompt_tokens == 0:
            return 0.0
        return self.total_cached_tokens / self.total_prompt_tokens

    @property
    def estimated_savings(self) -> float:
        """Estimated cost savings from caching (50% on cached tokens)."""
        return self.total_cached_tokens * 0.5

    def report(self) -> dict:
        return {
            "total_requests": self.total_requests,
            "cache_hit_rate": f"{self.cache_hit_rate:.1%}",
            "total_tokens_cached": self.total_cached_tokens,
            "estimated_token_savings": self.estimated_savings,
        }

A well-designed caching strategy achieves 60 to 80 percent cache hit rates on the prompt prefix. If your hit rate is below 40 percent, audit your prompt construction to find dynamic content that is breaking the prefix match.

FAQ

How long do cached prefixes persist?

OpenAI caches persist for 5 to 10 minutes of inactivity. Anthropic's ephemeral caches persist for roughly 5 minutes. Neither provider guarantees cache persistence — your application should work correctly whether the cache hits or misses. Design for caching but do not depend on it for correctness.

What is the minimum prefix length for caching?

OpenAI requires at least 1024 tokens in the matching prefix. Anthropic requires at least 1024 tokens for the content marked with cache control. Short system prompts do not benefit from caching. If your system prompt is under 1024 tokens, consider prepending static context like tool definitions or few-shot examples to reach the threshold.

Can I cache tool definitions and function schemas?

Yes, and this is one of the highest-value caching targets. Tool schemas are identical across requests and can be very long — 20 tools with detailed schemas easily exceed 2000 tokens. Place tool definitions in the system prompt before any dynamic content to maximize cache reuse.

#PromptEngineering #Caching #CostOptimization #Latency #Python #AgenticAI #LearnAI #AIEngineering

Prompt Caching Strategies: Reducing Latency and Cost with Cached Prefixes

The Hidden Cost of Repeated Prefixes

How OpenAI Prompt Caching Works

Designing Cache-Friendly Prompts

Anthropic's Explicit Cache Control

Measuring Cache Effectiveness

FAQ

How long do cached prefixes persist?

What is the minimum prefix length for caching?

Can I cache tool definitions and function schemas?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding