Prompt Caching Strategies: Reducing Latency and Cost with Cached Prefixes
Learn how to leverage prompt caching features from OpenAI and Anthropic to dramatically reduce latency and cost by reusing cached prompt prefixes across requests.
The Hidden Cost of Repeated Prefixes
In production LLM applications, the same text gets sent to the model thousands of times per day. Your system prompt, few-shot examples, tool definitions, and retrieval context templates are largely identical across requests. Every time you send this prefix, the model processes it from scratch — computing attention over the same tokens it processed moments ago.
Prompt caching eliminates this redundancy. Both OpenAI and Anthropic now offer server-side caching where the model stores the computed key-value (KV) cache for prompt prefixes. When a subsequent request shares the same prefix, the model skips recomputation and starts generating immediately.
The impact is substantial: OpenAI's prompt caching offers 50 percent cost reduction on cached tokens and up to 80 percent latency reduction. Anthropic's caching charges a small write fee for the first request but then offers 90 percent savings on cached reads.
How OpenAI Prompt Caching Works
OpenAI's prompt caching is automatic for supported models. When a request shares a prefix of at least 1024 tokens with a recent request, the cached portion is served at half price:
import openai
client = openai.OpenAI()
# This long system prompt will be cached after the first request
SYSTEM_PROMPT = """You are a financial analysis assistant for Acme Corp.
## Company Context
Acme Corp is a mid-cap technology company with the following key metrics:
- Revenue: $2.4B (FY 2025)
- Operating margin: 18.3%
- Employee count: 12,400
...
(imagine 2000+ tokens of company context, policies, and instructions)
## Analysis Guidelines
1. Always cite specific numbers from the provided data
2. Compare metrics to industry benchmarks
3. Flag any year-over-year changes exceeding 15%
4. Present findings in order of business impact
"""
def analyze_financial_data(user_query: str) -> dict:
"""Query with cached system prompt prefix."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_query},
],
)
usage = response.usage
return {
"answer": response.choices[0].message.content,
"cached_tokens": getattr(usage, "prompt_tokens_details", {})
.get("cached_tokens", 0),
"total_prompt_tokens": usage.prompt_tokens,
}
# First call: full processing (no cache)
result1 = analyze_financial_data("What is the revenue trend?")
print(f"Cached: {result1['cached_tokens']} / {result1['total_prompt_tokens']}")
# Subsequent calls: prefix is cached
result2 = analyze_financial_data("Analyze operating margins.")
print(f"Cached: {result2['cached_tokens']} / {result2['total_prompt_tokens']}")
Designing Cache-Friendly Prompts
The critical insight is that caching works on prefixes — the matching starts from the first token. Any change at the beginning of the prompt invalidates the entire cache. This means you should structure your prompts with static content first and dynamic content last:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def build_cache_friendly_prompt(
static_instructions: str,
static_examples: list[str],
dynamic_context: str,
user_query: str,
) -> list[dict]:
"""Structure prompt for maximum cache reuse."""
# Static prefix — identical across requests, cached
system_content = (
f"{static_instructions}\n\n"
"## Examples\n\n"
+ "\n\n".join(static_examples)
)
# Dynamic content — changes per request, not cached
user_content = (
f"## Context\n\n{dynamic_context}\n\n"
f"## Question\n\n{user_query}"
)
return [
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
]
# Anti-pattern: dynamic content in system prompt breaks cache
def bad_prompt_design(user_id: str, query: str) -> list[dict]:
"""This breaks caching because user_id changes per request."""
return [
{"role": "system", "content": f"User ID: {user_id}\n{SYSTEM_PROMPT}"},
{"role": "user", "content": query},
]
# Better: move dynamic content after the static prefix
def good_prompt_design(user_id: str, query: str) -> list[dict]:
"""Static prefix stays cacheable, dynamic content is appended."""
return [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"[User: {user_id}] {query}"},
]
Anthropic's Explicit Cache Control
Anthropic takes a different approach with explicit cache breakpoints. You mark exactly where in the prompt the cache should apply:
import anthropic
anthropic_client = anthropic.Anthropic()
def cached_anthropic_query(
static_context: str,
user_query: str,
) -> dict:
"""Use Anthropic's explicit cache control."""
response = anthropic_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": static_context,
"cache_control": {"type": "ephemeral"},
},
],
messages=[
{"role": "user", "content": user_query},
],
)
return {
"answer": response.content[0].text,
"input_tokens": response.usage.input_tokens,
"cache_read_tokens": getattr(
response.usage, "cache_read_input_tokens", 0
),
"cache_write_tokens": getattr(
response.usage, "cache_creation_input_tokens", 0
),
}
Measuring Cache Effectiveness
Track your cache hit rate to validate that your prompt design is actually benefiting from caching:
class CacheMetrics:
"""Track prompt caching effectiveness over time."""
def __init__(self):
self.total_requests = 0
self.total_prompt_tokens = 0
self.total_cached_tokens = 0
def record(self, prompt_tokens: int, cached_tokens: int):
self.total_requests += 1
self.total_prompt_tokens += prompt_tokens
self.total_cached_tokens += cached_tokens
@property
def cache_hit_rate(self) -> float:
if self.total_prompt_tokens == 0:
return 0.0
return self.total_cached_tokens / self.total_prompt_tokens
@property
def estimated_savings(self) -> float:
"""Estimated cost savings from caching (50% on cached tokens)."""
return self.total_cached_tokens * 0.5
def report(self) -> dict:
return {
"total_requests": self.total_requests,
"cache_hit_rate": f"{self.cache_hit_rate:.1%}",
"total_tokens_cached": self.total_cached_tokens,
"estimated_token_savings": self.estimated_savings,
}
A well-designed caching strategy achieves 60 to 80 percent cache hit rates on the prompt prefix. If your hit rate is below 40 percent, audit your prompt construction to find dynamic content that is breaking the prefix match.
FAQ
How long do cached prefixes persist?
OpenAI caches persist for 5 to 10 minutes of inactivity. Anthropic's ephemeral caches persist for roughly 5 minutes. Neither provider guarantees cache persistence — your application should work correctly whether the cache hits or misses. Design for caching but do not depend on it for correctness.
What is the minimum prefix length for caching?
OpenAI requires at least 1024 tokens in the matching prefix. Anthropic requires at least 1024 tokens for the content marked with cache control. Short system prompts do not benefit from caching. If your system prompt is under 1024 tokens, consider prepending static context like tool definitions or few-shot examples to reach the threshold.
Can I cache tool definitions and function schemas?
Yes, and this is one of the highest-value caching targets. Tool schemas are identical across requests and can be very long — 20 tools with detailed schemas easily exceed 2000 tokens. Place tool definitions in the system prompt before any dynamic content to maximize cache reuse.
#PromptEngineering #Caching #CostOptimization #Latency #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.