The Problem Prompt Caching Solves

Every Claude API call processes input tokens from scratch. If your system prompt is 3,000 tokens and you make 1,000 calls per day, you are paying to process the same 3,000 tokens 1,000 times. That is 3 million redundant tokens per day.

Prompt caching eliminates this waste. It tells the API to cache specific portions of the input so that subsequent requests can skip processing those tokens. The result: up to 90% reduction in input token costs and significant latency improvements on cached portions.

How Prompt Caching Works

When you mark a section of your prompt with a cache control breakpoint, the API:

First request: Processes all tokens normally and caches the marked section. You pay a small write premium (25% more than base input price for the cached tokens).
Subsequent requests: Reads the cached tokens instead of reprocessing them. Cached reads cost 90% less than regular input tokens.
Cache expiration: Cached content has a TTL of 5 minutes. Each cache hit resets the TTL. If no requests hit the cache for 5 minutes, it expires.

Pricing Breakdown (Claude Sonnet)

Token Type	Price per Million Tokens
Regular input	$3.00
Cache write (first time)	$3.75
Cache read (subsequent)	$0.30
Output	$15.00

The math is straightforward: after 2 cache hits, you have already saved money compared to not caching. After 10 hits, you have saved over 85%.

Implementation in Python

from anthropic import Anthropic

client = Anthropic()

# Define a large system prompt that should be cached
SYSTEM_PROMPT = """You are an expert financial analyst. You have access to the
following regulatory framework that governs all of your responses...

[Imagine 2,000+ tokens of regulatory guidelines, company policies,
formatting requirements, and domain-specific instructions here]
"""

# Large reference document to include in context
REFERENCE_DOC = """
[Imagine a 50-page financial report, approximately 15,000 tokens]
"""

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Cache this
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": REFERENCE_DOC,
                    "cache_control": {"type": "ephemeral"}  # Cache this too
                },
                {
                    "type": "text",
                    "text": "Summarize the key risk factors in this report."
                }
            ]
        }
    ],
)

# Check cache performance in the response
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Regular input tokens: {response.usage.input_tokens}")

Implementation in TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-5-20250514",
  max_tokens: 4096,
  system: [
    {
      type: "text",
      text: SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: REFERENCE_DOC,
          cache_control: { type: "ephemeral" },
        },
        {
          type: "text",
          text: "Summarize the key risk factors.",
        },
      ],
    },
  ],
});

console.log("Cache write:", response.usage.cache_creation_input_tokens);
console.log("Cache read:", response.usage.cache_read_input_tokens);

What Can Be Cached

You can place cache breakpoints on:

System prompts -- The most common and highest-ROI caching target
User messages -- Large documents, reference materials, conversation history
Tool definitions -- If you have many tools that do not change between calls
Images -- Base64-encoded images in the prompt

Minimum Cache Size

The content being cached must meet a minimum token threshold:

Model	Minimum Tokens for Caching
Claude Opus	1,024 tokens
Claude Sonnet	1,024 tokens
Claude Haiku	2,048 tokens

Content below these thresholds will not be cached, even with the cache_control marker.

Cache Placement Strategy

The order of content matters. Cache breakpoints create a prefix that is cached. Everything before the breakpoint (inclusive) is cached; everything after is processed fresh.

Optimal ordering for multi-turn conversations:

[System prompt - CACHED]
[Static reference documents - CACHED]
[Tool definitions - CACHED]
[Conversation history turns 1-N - CACHED at turn N]
[Latest user message - NOT cached, always fresh]

Multiple cache breakpoints:

You can set up to 4 cache breakpoints per request. Use them strategically:

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": system_instructions,
            "cache_control": {"type": "ephemeral"}  # Breakpoint 1
        }
    ],
    messages=[
        {"role": "user", "content": [
            {"type": "text", "text": reference_doc, "cache_control": {"type": "ephemeral"}},  # Breakpoint 2
        ]},
        {"role": "assistant", "content": previous_response},
        {"role": "user", "content": [
            {"type": "text", "text": conversation_context, "cache_control": {"type": "ephemeral"}},  # Breakpoint 3
            {"type": "text", "text": current_question},  # Fresh input
        ]},
    ],
)

Real-World Cost Analysis

Consider a customer support chatbot with:

2,500-token system prompt
10,000-token product knowledge base
Average 8-turn conversations
10,000 conversations per day

Without caching:

Input tokens per conversation: ~100,000 (cumulative across turns)
Daily input tokens: 1 billion
Daily input cost (Sonnet): $3,000

With caching:

Cached tokens per conversation: 12,500 (system + knowledge base)
Cache reads per conversation: 8 (one per turn)
Daily cache read tokens: 1 billion at $0.30/M = $300
Fresh input tokens: ~200M at $3/M = $600
Daily input cost: $900

Savings: $2,100 per day (70% reduction)

Latency Benefits

Prompt caching does not just save money -- it reduces time to first token (TTFT). Cached tokens are processed significantly faster than fresh tokens.

In practice, applications with large system prompts or reference documents see TTFT improvements of 40-60% on cached requests. For real-time applications like customer support chatbots, this improvement is immediately noticeable to users.

Common Pitfalls

Pitfall 1: Caching dynamic content. If the cached content changes on every request, you pay the cache write premium without ever getting a cache read. Only cache content that is stable across multiple requests.

Pitfall 2: Not monitoring cache hit rates. Use the cache_creation_input_tokens and cache_read_input_tokens fields in the response to track your cache performance. A healthy cache should have a read-to-write ratio above 5:1.

Pitfall 3: Cache invalidation from content changes. Even a single character change in the cached prefix invalidates the entire cache. If you need to update a knowledge base, batch the updates rather than making frequent small changes.

Pitfall 4: Exceeding the 5-minute TTL. If your application has bursty traffic with quiet periods longer than 5 minutes, the cache will expire between bursts. Consider implementing keep-alive requests during low-traffic periods if the savings justify it.

Prompt Caching in Claude: How It Cuts Latency and Cost by 90%