Claude Prompt Caching for Agent Systems: Reducing Costs by 90% on Repeated Contexts
Learn how to use Claude's prompt caching to dramatically reduce costs in agent systems by caching system prompts, tool definitions, and reference documents across multiple requests.
The Cost Problem in Agent Systems
Agent systems are expensive because every turn in the agent loop resends the entire conversation context — system prompt, tool definitions, previous messages, and tool results. A 10-turn agent interaction with a 4,000-token system prompt and 10 tool definitions means sending those same tokens 10 times. For high-volume agent systems processing thousands of conversations daily, this repetition dominates your API bill.
Claude's prompt caching solves this by allowing you to mark content that should be cached on Anthropic's servers. Cached content is read at 90% lower cost than fresh input tokens, and once cached, it persists for 5 minutes (extended each time it is used).
How Prompt Caching Works
You mark content for caching by adding cache_control annotations to your message blocks. Anthropic caches everything up to the annotated block, and subsequent requests that match the cached prefix get the discount.
flowchart TD
START["Claude Prompt Caching for Agent Systems: Reducing…"] --> A
A["The Cost Problem in Agent Systems"]
A --> B
B["How Prompt Caching Works"]
B --> C
C["Caching Tool Definitions"]
C --> D
D["Caching Reference Documents"]
D --> E
E["Cache-Friendly Architecture"]
E --> F
F["Monitoring Cache Performance"]
F --> G
G["FAQ"]
G --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
import anthropic
client = anthropic.Anthropic()
# System prompt with caching enabled
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer support agent for TechCorp. You handle billing inquiries, technical issues, and account management. Always verify customer identity before making account changes. Follow the escalation matrix for issues you cannot resolve...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "I need help with my billing"}
]
)
The cache_control: {"type": "ephemeral"} marker tells Anthropic to cache this content. The first request pays full input token price plus a small cache write fee. Every subsequent request within 5 minutes that starts with the same text pays only 10% of the input token cost for the cached portion.
Caching Tool Definitions
For agents with many tools, caching tool definitions provides the biggest savings because tool schemas are often large and identical across every request:
# Large tool definitions — perfect for caching
tools_with_cache = [
{
"name": "search_database",
"description": "Search the product database by various criteria",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"},
"category": {"type": "string"},
"price_min": {"type": "number"},
"price_max": {"type": "number"},
"in_stock": {"type": "boolean"}
},
"required": ["query"]
}
},
{
"name": "create_ticket",
"description": "Create a support ticket in the ticketing system",
"input_schema": {
"type": "object",
"properties": {
"subject": {"type": "string"},
"priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
"description": {"type": "string"},
"customer_id": {"type": "string"}
},
"required": ["subject", "priority", "description"]
}
},
# ... more tools
]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt,
"cache_control": {"type": "ephemeral"},
}
],
tools=tools_with_cache,
messages=messages,
)
When you send the same system prompt and tools across multiple conversations, the cached prefix is reused. The more tools and the longer the system prompt, the more you save.
Caching Reference Documents
Agent systems that reference static documents — product catalogs, policy documents, knowledge bases — benefit enormously from caching:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
# Load reference document once, cache it across all queries
with open("product_catalog.txt") as f:
catalog_text = f.read()
def answer_product_question(question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a product specialist. Answer questions using the product catalog below.",
},
{
"type": "text",
"text": catalog_text,
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{"role": "user", "content": question}
]
)
return response.content[0].text
A 50,000-token product catalog costs full price on the first call but only 10% on every subsequent call within the cache window. For a support system handling 100 queries per hour, this turns a substantial input cost into a rounding error.
Cache-Friendly Architecture
Design your agent's message structure to maximize cache hit rates:
def build_agent_messages(system_prompt: str, tools: list,
reference_docs: list[str],
conversation_history: list) -> dict:
"""Structure messages for optimal caching.
Order: system prompt -> reference docs -> tools -> conversation
Static content comes first so the cached prefix is longest.
"""
system_blocks = [
{
"type": "text",
"text": system_prompt,
}
]
# Add reference documents
for i, doc in enumerate(reference_docs):
block = {"type": "text", "text": doc}
# Cache after the last reference doc
if i == len(reference_docs) - 1:
block["cache_control"] = {"type": "ephemeral"}
system_blocks.append(block)
return {
"system": system_blocks,
"tools": tools,
"messages": conversation_history,
}
The key principle is prefix matching — caching works from the beginning of the content forward. Put static content (system prompt, reference docs) first, and dynamic content (conversation history) last.
Monitoring Cache Performance
Track cache hit rates to verify your caching strategy works:
def log_cache_metrics(response):
usage = response.usage
cached = getattr(usage, "cache_read_input_tokens", 0)
cache_created = getattr(usage, "cache_creation_input_tokens", 0)
total_input = usage.input_tokens
if total_input > 0:
cache_rate = cached / (cached + total_input) * 100
print(f"Cache hit rate: {cache_rate:.1f}%")
print(f"Cached tokens: {cached}, Fresh tokens: {total_input}")
if cache_created > 0:
print(f"New cache created: {cache_created} tokens")
A healthy agent system should show 80-95% cache hit rates on the system prompt and tool definitions after the initial warm-up request.
FAQ
How long does the cache last?
Cached content has a 5-minute TTL that resets every time the cache is hit. In practice, any system handling more than one request per 5 minutes keeps the cache warm indefinitely. If your traffic is bursty with long gaps, consider sending a lightweight "keep-alive" request to prevent cache expiration before a burst.
Is there a minimum content size for caching?
Yes. The content must be at least 1,024 tokens for Claude Sonnet and 2,048 tokens for Claude Opus to be eligible for caching. Short system prompts below these thresholds will not be cached even with the cache_control annotation. Combine your system prompt with reference documents to meet the minimum.
Does caching work across different conversations?
Yes, as long as the cached prefix is identical. Two different users asking different questions but sharing the same system prompt and tools will share the cache. This makes caching especially powerful for multi-tenant agent systems where every conversation uses the same base configuration.
#Claude #PromptCaching #CostOptimization #Performance #Python #AgenticAI #LearnAI #AIEngineering
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.