Skip to content
Learn Agentic AI
Learn Agentic AI11 min read0 views

Claude Prompt Caching for Agent Systems: Reducing Costs by 90% on Repeated Contexts

Learn how to use Claude's prompt caching to dramatically reduce costs in agent systems by caching system prompts, tool definitions, and reference documents across multiple requests.

The Cost Problem in Agent Systems

Agent systems are expensive because every turn in the agent loop resends the entire conversation context — system prompt, tool definitions, previous messages, and tool results. A 10-turn agent interaction with a 4,000-token system prompt and 10 tool definitions means sending those same tokens 10 times. For high-volume agent systems processing thousands of conversations daily, this repetition dominates your API bill.

Claude's prompt caching solves this by allowing you to mark content that should be cached on Anthropic's servers. Cached content is read at 90% lower cost than fresh input tokens, and once cached, it persists for 5 minutes (extended each time it is used).

How Prompt Caching Works

You mark content for caching by adding cache_control annotations to your message blocks. Anthropic caches everything up to the annotated block, and subsequent requests that match the cached prefix get the discount.

flowchart TD
    START["Claude Prompt Caching for Agent Systems: Reducing…"] --> A
    A["The Cost Problem in Agent Systems"]
    A --> B
    B["How Prompt Caching Works"]
    B --> C
    C["Caching Tool Definitions"]
    C --> D
    D["Caching Reference Documents"]
    D --> E
    E["Cache-Friendly Architecture"]
    E --> F
    F["Monitoring Cache Performance"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import anthropic

client = anthropic.Anthropic()

# System prompt with caching enabled
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer support agent for TechCorp. You handle billing inquiries, technical issues, and account management. Always verify customer identity before making account changes. Follow the escalation matrix for issues you cannot resolve...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "I need help with my billing"}
    ]
)

The cache_control: {"type": "ephemeral"} marker tells Anthropic to cache this content. The first request pays full input token price plus a small cache write fee. Every subsequent request within 5 minutes that starts with the same text pays only 10% of the input token cost for the cached portion.

Caching Tool Definitions

For agents with many tools, caching tool definitions provides the biggest savings because tool schemas are often large and identical across every request:

# Large tool definitions — perfect for caching
tools_with_cache = [
    {
        "name": "search_database",
        "description": "Search the product database by various criteria",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "category": {"type": "string"},
                "price_min": {"type": "number"},
                "price_max": {"type": "number"},
                "in_stock": {"type": "boolean"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "create_ticket",
        "description": "Create a support ticket in the ticketing system",
        "input_schema": {
            "type": "object",
            "properties": {
                "subject": {"type": "string"},
                "priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
                "description": {"type": "string"},
                "customer_id": {"type": "string"}
            },
            "required": ["subject", "priority", "description"]
        }
    },
    # ... more tools
]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    tools=tools_with_cache,
    messages=messages,
)

When you send the same system prompt and tools across multiple conversations, the cached prefix is reused. The more tools and the longer the system prompt, the more you save.

Caching Reference Documents

Agent systems that reference static documents — product catalogs, policy documents, knowledge bases — benefit enormously from caching:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

# Load reference document once, cache it across all queries
with open("product_catalog.txt") as f:
    catalog_text = f.read()

def answer_product_question(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a product specialist. Answer questions using the product catalog below.",
            },
            {
                "type": "text",
                "text": catalog_text,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[
            {"role": "user", "content": question}
        ]
    )
    return response.content[0].text

A 50,000-token product catalog costs full price on the first call but only 10% on every subsequent call within the cache window. For a support system handling 100 queries per hour, this turns a substantial input cost into a rounding error.

Cache-Friendly Architecture

Design your agent's message structure to maximize cache hit rates:

def build_agent_messages(system_prompt: str, tools: list,
                         reference_docs: list[str],
                         conversation_history: list) -> dict:
    """Structure messages for optimal caching.

    Order: system prompt -> reference docs -> tools -> conversation
    Static content comes first so the cached prefix is longest.
    """
    system_blocks = [
        {
            "type": "text",
            "text": system_prompt,
        }
    ]

    # Add reference documents
    for i, doc in enumerate(reference_docs):
        block = {"type": "text", "text": doc}
        # Cache after the last reference doc
        if i == len(reference_docs) - 1:
            block["cache_control"] = {"type": "ephemeral"}
        system_blocks.append(block)

    return {
        "system": system_blocks,
        "tools": tools,
        "messages": conversation_history,
    }

The key principle is prefix matching — caching works from the beginning of the content forward. Put static content (system prompt, reference docs) first, and dynamic content (conversation history) last.

Monitoring Cache Performance

Track cache hit rates to verify your caching strategy works:

def log_cache_metrics(response):
    usage = response.usage
    cached = getattr(usage, "cache_read_input_tokens", 0)
    cache_created = getattr(usage, "cache_creation_input_tokens", 0)
    total_input = usage.input_tokens

    if total_input > 0:
        cache_rate = cached / (cached + total_input) * 100
        print(f"Cache hit rate: {cache_rate:.1f}%")
        print(f"Cached tokens: {cached}, Fresh tokens: {total_input}")
        if cache_created > 0:
            print(f"New cache created: {cache_created} tokens")

A healthy agent system should show 80-95% cache hit rates on the system prompt and tool definitions after the initial warm-up request.

FAQ

How long does the cache last?

Cached content has a 5-minute TTL that resets every time the cache is hit. In practice, any system handling more than one request per 5 minutes keeps the cache warm indefinitely. If your traffic is bursty with long gaps, consider sending a lightweight "keep-alive" request to prevent cache expiration before a burst.

Is there a minimum content size for caching?

Yes. The content must be at least 1,024 tokens for Claude Sonnet and 2,048 tokens for Claude Opus to be eligible for caching. Short system prompts below these thresholds will not be cached even with the cache_control annotation. Combine your system prompt with reference documents to meet the minimum.

Does caching work across different conversations?

Yes, as long as the cached prefix is identical. Two different users asking different questions but sharing the same system prompt and tools will share the cache. This makes caching especially powerful for multi-tenant agent systems where every conversation uses the same base configuration.


#Claude #PromptCaching #CostOptimization #Performance #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.