Claude's Extended Thinking: When to Use It and When Not To
Understand Claude's extended thinking feature, how it improves reasoning quality for complex tasks, when it adds value vs. unnecessary cost, and implementation patterns for production applications.
What Is Extended Thinking?
Extended thinking is a Claude feature that allocates dedicated reasoning tokens before generating the final response. When enabled, Claude produces a chain-of-thought "thinking" block where it reasons through the problem step by step, then generates its answer based on that reasoning.
This is different from simply asking Claude to "think step by step" in the prompt. Extended thinking uses a separate token budget and processing phase specifically designed for deep reasoning, and the thinking content is returned separately from the response so you can inspect Claude's reasoning process.
How to Enable Extended Thinking
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # Up to 128K thinking tokens
},
messages=[{
"role": "user",
"content": "A farmer needs to cross a river with a wolf, a goat, and a cabbage. The boat can only carry the farmer and one item. If left alone, the wolf will eat the goat, and the goat will eat the cabbage. How can the farmer get everything across safely?"
}]
)
# The response contains both thinking and text blocks
for block in response.content:
if block.type == "thinking":
print("=== THINKING ===")
print(block.thinking)
elif block.type == "text":
print("\n=== RESPONSE ===")
print(block.text)
When Extended Thinking Adds Value
Complex Mathematical Reasoning
Extended thinking dramatically improves accuracy on multi-step math problems. Without it, Claude might skip steps or make arithmetic errors. With it, Claude works through each step methodically.
Benchmark improvement: On the MATH benchmark, extended thinking improves accuracy by 10-20 percentage points compared to standard responses.
Code Architecture Decisions
When designing complex systems, extended thinking helps Claude consider more alternatives, evaluate tradeoffs, and arrive at better-reasoned recommendations:
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 8000},
messages=[{
"role": "user",
"content": """Design the database schema for a multi-tenant SaaS application that needs:
- Per-tenant data isolation
- Shared resources for common configurations
- Audit logging for compliance
- Support for 10,000+ tenants with varying data volumes
- Sub-100ms query latency for dashboard queries
Consider row-level security, partitioning strategies, and caching layers."""
}]
)
Ambiguous Requirements Analysis
When requirements are vague or contradictory, extended thinking helps Claude identify and reason through the ambiguities:
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 5000},
messages=[{
"role": "user",
"content": """Our client wants a 'fast, secure, and cheap' authentication system
that supports 'millions of users' with 'zero downtime' and must be built in '2 weeks.'
Identify the tradeoffs and propose a realistic architecture."""
}]
)
Multi-Step Planning
Extended thinking excels at tasks that require planning multiple steps with dependencies:
- Migration planning for large codebases
- Incident response procedures
- Project decomposition and scheduling
- Complex SQL query construction
When NOT to Use Extended Thinking
Simple Factual Questions
"What is the capital of France?" does not benefit from extended thinking. The answer is immediate and certain. Thinking tokens are wasted.
Template-Based Generation
Generating emails, form letters, or structured outputs from templates does not require deep reasoning. The overhead of thinking tokens adds cost without improving quality.
Classification Tasks
Binary or multi-class classification is typically a pattern-matching task that does not benefit from extended reasoning:
# DON'T use extended thinking for this
response = client.messages.create(
model="claude-haiku-4-5-20250514", # Use Haiku, no thinking
max_tokens=100,
messages=[{
"role": "user",
"content": "Classify this email as spam or not spam: 'You won $1M! Click here...'"
}]
)
High-Volume, Low-Latency Applications
Extended thinking adds latency (the thinking phase runs before the response begins) and cost (thinking tokens are billed as output tokens). For chatbots handling thousands of concurrent conversations, the overhead is unjustified for routine queries.
Cost and Latency Impact
Token Costs
Thinking tokens are billed as output tokens. At Claude Sonnet rates:
| Budget | Thinking Cost | Typical Response Cost | Total |
|---|---|---|---|
| 1,000 tokens | $0.015 | $0.015 | $0.030 |
| 5,000 tokens | $0.075 | $0.015 | $0.090 |
| 10,000 tokens | $0.150 | $0.015 | $0.165 |
| 50,000 tokens | $0.750 | $0.015 | $0.765 |
Latency Impact
Thinking tokens must be generated before the response begins, which directly increases time to first token (TTFT):
- 1,000 thinking tokens: +1-2 seconds TTFT
- 5,000 thinking tokens: +5-10 seconds TTFT
- 10,000 thinking tokens: +10-20 seconds TTFT
For interactive applications, keep thinking budgets modest (1,000-5,000 tokens). For offline analysis, larger budgets (10,000-50,000) are acceptable.
Streaming with Extended Thinking
You can stream both the thinking and response phases:
with client.messages.stream(
model="claude-sonnet-4-5-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 5000},
messages=[{"role": "user", "content": "Design a rate limiter for a distributed system."}],
) as stream:
current_phase = None
for event in stream:
if event.type == "content_block_start":
if event.content_block.type == "thinking":
current_phase = "thinking"
print("\n[Thinking...]")
elif event.content_block.type == "text":
current_phase = "response"
print("\n[Response]")
elif event.type == "content_block_delta":
if event.delta.type == "thinking_delta":
pass # Optionally show thinking to user
elif event.delta.type == "text_delta":
print(event.delta.text, end="", flush=True)
Practical Decision Framework
Use this flowchart to decide whether to enable extended thinking:
- Is the task time-sensitive (< 3 second response needed)? -> No extended thinking
- Is the answer deterministic or template-based? -> No extended thinking
- Does the task involve multi-step reasoning? -> Yes, use 3,000-5,000 budget
- Does the task involve complex analysis with tradeoffs? -> Yes, use 5,000-10,000 budget
- Is this an offline analysis or batch job? -> Yes, use 10,000-50,000 budget
- Is correctness critical (financial, medical, legal)? -> Yes, use maximum budget
Multi-Turn Conversations with Thinking
In multi-turn conversations, previous thinking blocks are included in the conversation history. This means Claude can build on its prior reasoning. However, thinking tokens from previous turns count toward input tokens, which can be expensive.
# First turn with thinking
messages = [{"role": "user", "content": "Design a caching strategy for our API."}]
response1 = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 5000},
messages=messages,
)
# Second turn -- include previous thinking in history
messages.append({"role": "assistant", "content": response1.content})
messages.append({"role": "user", "content": "Now consider how this works with database read replicas."})
response2 = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 5000},
messages=messages,
)
Redacting Thinking in Production
In some applications, you may want to use extended thinking for quality but not expose the thinking process to end users. The thinking content is returned in a separate block, making it easy to filter:
def get_response_only(response) -> str:
"""Extract only the text response, discarding thinking blocks."""
return "".join(
block.text for block in response.content if block.type == "text"
)
def get_thinking_only(response) -> str:
"""Extract only thinking blocks for debugging/logging."""
return "".join(
block.thinking for block in response.content if block.type == "thinking"
)
Log the thinking content for debugging and quality analysis, but only return the text response to users.
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.