Prompt Caching in Claude: How It Cuts Latency and Cost by 90%
Technical deep dive into Claude's prompt caching feature. Learn how it works, when to use it, implementation patterns for both Python and TypeScript, and real-world cost savings analysis.
The Problem Prompt Caching Solves
Every Claude API call processes input tokens from scratch. If your system prompt is 3,000 tokens and you make 1,000 calls per day, you are paying to process the same 3,000 tokens 1,000 times. That is 3 million redundant tokens per day.
Prompt caching eliminates this waste. It tells the API to cache specific portions of the input so that subsequent requests can skip processing those tokens. The result: up to 90% reduction in input token costs and significant latency improvements on cached portions.
How Prompt Caching Works
When you mark a section of your prompt with a cache control breakpoint, the API:
- First request: Processes all tokens normally and caches the marked section. You pay a small write premium (25% more than base input price for the cached tokens).
- Subsequent requests: Reads the cached tokens instead of reprocessing them. Cached reads cost 90% less than regular input tokens.
- Cache expiration: Cached content has a TTL of 5 minutes. Each cache hit resets the TTL. If no requests hit the cache for 5 minutes, it expires.
Pricing Breakdown (Claude Sonnet)
| Token Type | Price per Million Tokens |
|---|---|
| Regular input | $3.00 |
| Cache write (first time) | $3.75 |
| Cache read (subsequent) | $0.30 |
| Output | $15.00 |
The math is straightforward: after 2 cache hits, you have already saved money compared to not caching. After 10 hits, you have saved over 85%.
Implementation in Python
from anthropic import Anthropic
client = Anthropic()
# Define a large system prompt that should be cached
SYSTEM_PROMPT = """You are an expert financial analyst. You have access to the
following regulatory framework that governs all of your responses...
[Imagine 2,000+ tokens of regulatory guidelines, company policies,
formatting requirements, and domain-specific instructions here]
"""
# Large reference document to include in context
REFERENCE_DOC = """
[Imagine a 50-page financial report, approximately 15,000 tokens]
"""
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=4096,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Cache this
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": REFERENCE_DOC,
"cache_control": {"type": "ephemeral"} # Cache this too
},
{
"type": "text",
"text": "Summarize the key risk factors in this report."
}
]
}
],
)
# Check cache performance in the response
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Regular input tokens: {response.usage.input_tokens}")
Implementation in TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-5-20250514",
max_tokens: 4096,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
],
messages: [
{
role: "user",
content: [
{
type: "text",
text: REFERENCE_DOC,
cache_control: { type: "ephemeral" },
},
{
type: "text",
text: "Summarize the key risk factors.",
},
],
},
],
});
console.log("Cache write:", response.usage.cache_creation_input_tokens);
console.log("Cache read:", response.usage.cache_read_input_tokens);
What Can Be Cached
You can place cache breakpoints on:
- System prompts -- The most common and highest-ROI caching target
- User messages -- Large documents, reference materials, conversation history
- Tool definitions -- If you have many tools that do not change between calls
- Images -- Base64-encoded images in the prompt
Minimum Cache Size
The content being cached must meet a minimum token threshold:
| Model | Minimum Tokens for Caching |
|---|---|
| Claude Opus | 1,024 tokens |
| Claude Sonnet | 1,024 tokens |
| Claude Haiku | 2,048 tokens |
Content below these thresholds will not be cached, even with the cache_control marker.
Cache Placement Strategy
The order of content matters. Cache breakpoints create a prefix that is cached. Everything before the breakpoint (inclusive) is cached; everything after is processed fresh.
Optimal ordering for multi-turn conversations:
[System prompt - CACHED]
[Static reference documents - CACHED]
[Tool definitions - CACHED]
[Conversation history turns 1-N - CACHED at turn N]
[Latest user message - NOT cached, always fresh]
Multiple cache breakpoints:
You can set up to 4 cache breakpoints per request. Use them strategically:
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=4096,
system=[
{
"type": "text",
"text": system_instructions,
"cache_control": {"type": "ephemeral"} # Breakpoint 1
}
],
messages=[
{"role": "user", "content": [
{"type": "text", "text": reference_doc, "cache_control": {"type": "ephemeral"}}, # Breakpoint 2
]},
{"role": "assistant", "content": previous_response},
{"role": "user", "content": [
{"type": "text", "text": conversation_context, "cache_control": {"type": "ephemeral"}}, # Breakpoint 3
{"type": "text", "text": current_question}, # Fresh input
]},
],
)
Real-World Cost Analysis
Consider a customer support chatbot with:
- 2,500-token system prompt
- 10,000-token product knowledge base
- Average 8-turn conversations
- 10,000 conversations per day
Without caching:
- Input tokens per conversation: ~100,000 (cumulative across turns)
- Daily input tokens: 1 billion
- Daily input cost (Sonnet): $3,000
With caching:
- Cached tokens per conversation: 12,500 (system + knowledge base)
- Cache reads per conversation: 8 (one per turn)
- Daily cache read tokens: 1 billion at $0.30/M = $300
- Fresh input tokens: ~200M at $3/M = $600
- Daily input cost: $900
Savings: $2,100 per day (70% reduction)
Latency Benefits
Prompt caching does not just save money -- it reduces time to first token (TTFT). Cached tokens are processed significantly faster than fresh tokens.
In practice, applications with large system prompts or reference documents see TTFT improvements of 40-60% on cached requests. For real-time applications like customer support chatbots, this improvement is immediately noticeable to users.
Common Pitfalls
Pitfall 1: Caching dynamic content. If the cached content changes on every request, you pay the cache write premium without ever getting a cache read. Only cache content that is stable across multiple requests.
Pitfall 2: Not monitoring cache hit rates. Use the cache_creation_input_tokens and cache_read_input_tokens fields in the response to track your cache performance. A healthy cache should have a read-to-write ratio above 5:1.
Pitfall 3: Cache invalidation from content changes. Even a single character change in the cached prefix invalidates the entire cache. If you need to update a knowledge base, batch the updates rather than making frequent small changes.
Pitfall 4: Exceeding the 5-minute TTL. If your application has bursty traffic with quiet periods longer than 5 minutes, the cache will expire between bursts. Consider implementing keep-alive requests during low-traffic periods if the savings justify it.
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.