Claude Prompt Caching: Reducing Costs and Latency with Cached Context

The Cost Problem with Large Contexts

Agent systems frequently send large, repetitive context with every API call: system prompts, tool definitions, reference documents, conversation history. A typical agent call might include a 3,000-token system prompt, 2,000 tokens of tool definitions, and 10,000 tokens of reference documentation — all of which are identical across requests. You pay full input token pricing every time.

Prompt caching solves this by letting Anthropic's servers cache the static portions of your prompt. Subsequent requests that share the same prefix pay a dramatically reduced rate for cached tokens — roughly 90% less than standard input pricing — and enjoy lower latency because the cached tokens do not need to be reprocessed.

How Prompt Caching Works

You add cache_control markers to your message content to tell the API where to create cache breakpoints. Everything before a breakpoint is eligible for caching:

import anthropic

client = anthropic.Anthropic()

# The system prompt with a cache breakpoint
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert legal assistant specializing in contract analysis. "
                    "You help users understand complex legal documents, identify risks, "
                    "and suggest improvements. Always cite specific clauses when giving advice.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "What should I look for in an NDA?"}
    ]
)

# Check cache usage in the response
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

The first request creates the cache (cache_creation_input_tokens > 0). Subsequent requests with the same prefix hit the cache (cache_read_input_tokens > 0), paying the reduced rate.

Caching Large Reference Documents

The biggest cost savings come from caching large static documents that accompany every request:

import anthropic
from pathlib import Path

client = anthropic.Anthropic()

# Load a large reference document (e.g., product documentation)
product_docs = Path("product_manual.txt").read_text()

def ask_about_product(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=[
            {
                "type": "text",
                "text": "You are a product support agent. Answer questions using only the provided documentation."
            }
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Reference documentation:\n\n{product_docs}",
                        "cache_control": {"type": "ephemeral"}
                    },
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            }
        ]
    )
    return response.content[0].text

# First call creates the cache
answer1 = ask_about_product("How do I reset my password?")

# Subsequent calls hit the cache - faster and cheaper
answer2 = ask_about_product("What are the system requirements?")
answer3 = ask_about_product("How do I export my data?")

If product_docs is 15,000 tokens, the second and third calls save roughly 90% on those 15,000 tokens. Over hundreds of support requests, this adds up to substantial savings.

Cache-Friendly Prompt Design

The cache works on prefix matching — everything before the cache breakpoint must be identical across requests. Structure your prompts accordingly:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import anthropic

client = anthropic.Anthropic()

# GOOD: Static content first, then dynamic content
# The system prompt and tools are identical across requests
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are a data analysis agent for Acme Corp.

## Available Databases
- analytics_db: User behavior and events
- billing_db: Subscriptions and payments
- support_db: Tickets and customer interactions

## Rules
- Always use parameterized queries
- Limit results to 100 rows unless asked otherwise
- Format currency as USD with two decimal places""",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    tools=[
        {
            "name": "run_query",
            "description": "Execute a SQL query against a specified database.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "database": {"type": "string", "enum": ["analytics_db", "billing_db", "support_db"]},
                    "query": {"type": "string"}
                },
                "required": ["database", "query"]
            },
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "How many new signups did we get last week?"}
    ]
)

By placing cache_control on both the system prompt and the last tool definition, you cache everything static. Only the user message changes between requests.

Multiple Cache Breakpoints

You can set up to four cache breakpoints to cache different layers of context:

import anthropic

client = anthropic.Anthropic()

conversation_history = [
    {"role": "user", "content": "I need help analyzing our Q4 sales data."},
    {"role": "assistant", "content": "I would be happy to help with Q4 sales analysis. What specific metrics are you interested in?"},
    {"role": "user", "content": "Start with revenue by region."},
    {"role": "assistant", "content": "Here is the Q4 revenue breakdown by region..."},
]

# Cache the system prompt (breakpoint 1)
# Cache the conversation history (breakpoint 2)
# Only the new user message is uncached
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": "You are a sales analytics agent with access to Acme Corp's data warehouse.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        *conversation_history[:-1],
        {**conversation_history[-1], "content": [
            {"type": "text", "text": conversation_history[-1]["content"], "cache_control": {"type": "ephemeral"}}
        ]},
        {"role": "user", "content": "Now break it down by product category within each region."}
    ]
)

This caches the system prompt separately from the conversation history, so even as conversation grows, the system prompt cache persists.

Measuring Cost Savings

Track cache hit rates to quantify savings:

import anthropic

client = anthropic.Anthropic()

class CacheTracker:
    def __init__(self):
        self.total_input = 0
        self.cache_created = 0
        self.cache_read = 0
        self.requests = 0

    def record(self, usage):
        self.total_input += usage.input_tokens
        self.cache_created += getattr(usage, "cache_creation_input_tokens", 0)
        self.cache_read += getattr(usage, "cache_read_input_tokens", 0)
        self.requests += 1

    def report(self):
        hit_rate = self.cache_read / (self.total_input + self.cache_read) * 100 if self.total_input else 0
        print(f"Requests: {self.requests}")
        print(f"Cache hit rate: {hit_rate:.1f}%")
        print(f"Tokens served from cache: {self.cache_read:,}")
        print(f"Estimated savings: ~{self.cache_read * 0.0000027:.4f} USD")

tracker = CacheTracker()

# Run multiple requests and track
for question in ["What is X?", "How does Y work?", "Compare X and Y"]:
    resp = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        system=[{"type": "text", "text": "You are a helpful assistant.", "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": question}]
    )
    tracker.record(resp.usage)

tracker.report()

FAQ

How long do cached prompts persist?

Cached prompts have a time-to-live (TTL) of approximately 5 minutes, which resets each time the cache is hit. In practice, if your agent handles at least one request every few minutes, the cache stays warm indefinitely. For low-traffic applications, consider implementing a periodic cache-warming request.

Is there a minimum size for cached content?

Yes. The cached prefix must be at least 1,024 tokens for Claude 3.5 Sonnet and 2,048 tokens for Claude 3.5 Haiku. Content shorter than these thresholds will not be cached. This means caching is most effective for agents with substantial system prompts, tool definitions, or reference documents.

Does prompt caching work with streaming and tool use?

Yes. Prompt caching is fully compatible with streaming, tool use, and extended thinking. The cache applies to the input processing phase, so it reduces both cost and time-to-first-token regardless of the output mode you choose.

#Anthropic #Claude #PromptCaching #CostOptimization #Performance #AgenticAI #LearnAI #AIEngineering

Claude Prompt Caching: Reducing Costs and Latency with Cached Context

The Cost Problem with Large Contexts

How Prompt Caching Works

Caching Large Reference Documents

Cache-Friendly Prompt Design

Multiple Cache Breakpoints

Measuring Cost Savings

FAQ

How long do cached prompts persist?

Is there a minimum size for cached content?

Does prompt caching work with streaming and tool use?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding