Skip to content
Learn Agentic AI11 min read0 views

Claude Prompt Caching: Reducing Costs and Latency with Cached Context

Cut Claude API costs by up to 90% with prompt caching. Learn how to use cache_control breakpoints, design cache-friendly prompts, and measure cost savings in production agent systems.

The Cost Problem with Large Contexts

Agent systems frequently send large, repetitive context with every API call: system prompts, tool definitions, reference documents, conversation history. A typical agent call might include a 3,000-token system prompt, 2,000 tokens of tool definitions, and 10,000 tokens of reference documentation — all of which are identical across requests. You pay full input token pricing every time.

Prompt caching solves this by letting Anthropic's servers cache the static portions of your prompt. Subsequent requests that share the same prefix pay a dramatically reduced rate for cached tokens — roughly 90% less than standard input pricing — and enjoy lower latency because the cached tokens do not need to be reprocessed.

How Prompt Caching Works

You add cache_control markers to your message content to tell the API where to create cache breakpoints. Everything before a breakpoint is eligible for caching:

import anthropic

client = anthropic.Anthropic()

# The system prompt with a cache breakpoint
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert legal assistant specializing in contract analysis. "
                    "You help users understand complex legal documents, identify risks, "
                    "and suggest improvements. Always cite specific clauses when giving advice.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "What should I look for in an NDA?"}
    ]
)

# Check cache usage in the response
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

The first request creates the cache (cache_creation_input_tokens > 0). Subsequent requests with the same prefix hit the cache (cache_read_input_tokens > 0), paying the reduced rate.

Caching Large Reference Documents

The biggest cost savings come from caching large static documents that accompany every request:

import anthropic
from pathlib import Path

client = anthropic.Anthropic()

# Load a large reference document (e.g., product documentation)
product_docs = Path("product_manual.txt").read_text()

def ask_about_product(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=[
            {
                "type": "text",
                "text": "You are a product support agent. Answer questions using only the provided documentation."
            }
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Reference documentation:\n\n{product_docs}",
                        "cache_control": {"type": "ephemeral"}
                    },
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            }
        ]
    )
    return response.content[0].text

# First call creates the cache
answer1 = ask_about_product("How do I reset my password?")

# Subsequent calls hit the cache - faster and cheaper
answer2 = ask_about_product("What are the system requirements?")
answer3 = ask_about_product("How do I export my data?")

If product_docs is 15,000 tokens, the second and third calls save roughly 90% on those 15,000 tokens. Over hundreds of support requests, this adds up to substantial savings.

Cache-Friendly Prompt Design

The cache works on prefix matching — everything before the cache breakpoint must be identical across requests. Structure your prompts accordingly:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import anthropic

client = anthropic.Anthropic()

# GOOD: Static content first, then dynamic content
# The system prompt and tools are identical across requests
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are a data analysis agent for Acme Corp.

## Available Databases
- analytics_db: User behavior and events
- billing_db: Subscriptions and payments
- support_db: Tickets and customer interactions

## Rules
- Always use parameterized queries
- Limit results to 100 rows unless asked otherwise
- Format currency as USD with two decimal places""",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    tools=[
        {
            "name": "run_query",
            "description": "Execute a SQL query against a specified database.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "database": {"type": "string", "enum": ["analytics_db", "billing_db", "support_db"]},
                    "query": {"type": "string"}
                },
                "required": ["database", "query"]
            },
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "How many new signups did we get last week?"}
    ]
)

By placing cache_control on both the system prompt and the last tool definition, you cache everything static. Only the user message changes between requests.

Multiple Cache Breakpoints

You can set up to four cache breakpoints to cache different layers of context:

import anthropic

client = anthropic.Anthropic()

conversation_history = [
    {"role": "user", "content": "I need help analyzing our Q4 sales data."},
    {"role": "assistant", "content": "I would be happy to help with Q4 sales analysis. What specific metrics are you interested in?"},
    {"role": "user", "content": "Start with revenue by region."},
    {"role": "assistant", "content": "Here is the Q4 revenue breakdown by region..."},
]

# Cache the system prompt (breakpoint 1)
# Cache the conversation history (breakpoint 2)
# Only the new user message is uncached
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": "You are a sales analytics agent with access to Acme Corp's data warehouse.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        *conversation_history[:-1],
        {**conversation_history[-1], "content": [
            {"type": "text", "text": conversation_history[-1]["content"], "cache_control": {"type": "ephemeral"}}
        ]},
        {"role": "user", "content": "Now break it down by product category within each region."}
    ]
)

This caches the system prompt separately from the conversation history, so even as conversation grows, the system prompt cache persists.

Measuring Cost Savings

Track cache hit rates to quantify savings:

import anthropic

client = anthropic.Anthropic()

class CacheTracker:
    def __init__(self):
        self.total_input = 0
        self.cache_created = 0
        self.cache_read = 0
        self.requests = 0

    def record(self, usage):
        self.total_input += usage.input_tokens
        self.cache_created += getattr(usage, "cache_creation_input_tokens", 0)
        self.cache_read += getattr(usage, "cache_read_input_tokens", 0)
        self.requests += 1

    def report(self):
        hit_rate = self.cache_read / (self.total_input + self.cache_read) * 100 if self.total_input else 0
        print(f"Requests: {self.requests}")
        print(f"Cache hit rate: {hit_rate:.1f}%")
        print(f"Tokens served from cache: {self.cache_read:,}")
        print(f"Estimated savings: ~{self.cache_read * 0.0000027:.4f} USD")

tracker = CacheTracker()

# Run multiple requests and track
for question in ["What is X?", "How does Y work?", "Compare X and Y"]:
    resp = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        system=[{"type": "text", "text": "You are a helpful assistant.", "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": question}]
    )
    tracker.record(resp.usage)

tracker.report()

FAQ

How long do cached prompts persist?

Cached prompts have a time-to-live (TTL) of approximately 5 minutes, which resets each time the cache is hit. In practice, if your agent handles at least one request every few minutes, the cache stays warm indefinitely. For low-traffic applications, consider implementing a periodic cache-warming request.

Is there a minimum size for cached content?

Yes. The cached prefix must be at least 1,024 tokens for Claude 3.5 Sonnet and 2,048 tokens for Claude 3.5 Haiku. Content shorter than these thresholds will not be cached. This means caching is most effective for agents with substantial system prompts, tool definitions, or reference documents.

Does prompt caching work with streaming and tool use?

Yes. Prompt caching is fully compatible with streaming, tool use, and extended thinking. The cache applies to the input processing phase, so it reduces both cost and time-to-first-token regardless of the output mode you choose.


#Anthropic #Claude #PromptCaching #CostOptimization #Performance #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.