What Is Extended Thinking?

Extended thinking is a Claude feature that allocates dedicated reasoning tokens before generating the final response. When enabled, Claude produces a chain-of-thought "thinking" block where it reasons through the problem step by step, then generates its answer based on that reasoning.

This is different from simply asking Claude to "think step by step" in the prompt. Extended thinking uses a separate token budget and processing phase specifically designed for deep reasoning, and the thinking content is returned separately from the response so you can inspect Claude's reasoning process.

How to Enable Extended Thinking

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Up to 128K thinking tokens
    },
    messages=[{
        "role": "user",
        "content": "A farmer needs to cross a river with a wolf, a goat, and a cabbage. The boat can only carry the farmer and one item. If left alone, the wolf will eat the goat, and the goat will eat the cabbage. How can the farmer get everything across safely?"
    }]
)

# The response contains both thinking and text blocks
for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
    elif block.type == "text":
        print("\n=== RESPONSE ===")
        print(block.text)

When Extended Thinking Adds Value

Complex Mathematical Reasoning

Extended thinking dramatically improves accuracy on multi-step math problems. Without it, Claude might skip steps or make arithmetic errors. With it, Claude works through each step methodically.

flowchart TD
    START["Claude's Extended Thinking: When to Use It and Wh…"] --> A
    A["What Is Extended Thinking?"]
    A --> B
    B["How to Enable Extended Thinking"]
    B --> C
    C["When Extended Thinking Adds Value"]
    C --> D
    D["When NOT to Use Extended Thinking"]
    D --> E
    E["Cost and Latency Impact"]
    E --> F
    F["Streaming with Extended Thinking"]
    F --> G
    G["Practical Decision Framework"]
    G --> H
    H["Multi-Turn Conversations with Thinking"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Benchmark improvement: On the MATH benchmark, extended thinking improves accuracy by 10-20 percentage points compared to standard responses.

Code Architecture Decisions

When designing complex systems, extended thinking helps Claude consider more alternatives, evaluate tradeoffs, and arrive at better-reasoned recommendations:

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{
        "role": "user",
        "content": """Design the database schema for a multi-tenant SaaS application that needs:
- Per-tenant data isolation
- Shared resources for common configurations
- Audit logging for compliance
- Support for 10,000+ tenants with varying data volumes
- Sub-100ms query latency for dashboard queries

Consider row-level security, partitioning strategies, and caching layers."""
    }]
)

Ambiguous Requirements Analysis

When requirements are vague or contradictory, extended thinking helps Claude identify and reason through the ambiguities:

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{
        "role": "user",
        "content": """Our client wants a 'fast, secure, and cheap' authentication system
that supports 'millions of users' with 'zero downtime' and must be built in '2 weeks.'
Identify the tradeoffs and propose a realistic architecture."""
    }]
)

Multi-Step Planning

Extended thinking excels at tasks that require planning multiple steps with dependencies:

Migration planning for large codebases
Incident response procedures
Project decomposition and scheduling
Complex SQL query construction

When NOT to Use Extended Thinking

Simple Factual Questions

"What is the capital of France?" does not benefit from extended thinking. The answer is immediate and certain. Thinking tokens are wasted.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

Template-Based Generation

Generating emails, form letters, or structured outputs from templates does not require deep reasoning. The overhead of thinking tokens adds cost without improving quality.

Classification Tasks

Binary or multi-class classification is typically a pattern-matching task that does not benefit from extended reasoning:

# DON'T use extended thinking for this
response = client.messages.create(
    model="claude-haiku-4-5-20250514",  # Use Haiku, no thinking
    max_tokens=100,
    messages=[{
        "role": "user",
        "content": "Classify this email as spam or not spam: 'You won $1M! Click here...'"
    }]
)

High-Volume, Low-Latency Applications

Extended thinking adds latency (the thinking phase runs before the response begins) and cost (thinking tokens are billed as output tokens). For chatbots handling thousands of concurrent conversations, the overhead is unjustified for routine queries.

Cost and Latency Impact

Token Costs

Thinking tokens are billed as output tokens. At Claude Sonnet rates:

flowchart TD
    ROOT["Claude's Extended Thinking: When to Use It a…"] 
    ROOT --> P0["When Extended Thinking Adds Value"]
    P0 --> P0C0["Complex Mathematical Reasoning"]
    P0 --> P0C1["Code Architecture Decisions"]
    P0 --> P0C2["Ambiguous Requirements Analysis"]
    P0 --> P0C3["Multi-Step Planning"]
    ROOT --> P1["When NOT to Use Extended Thinking"]
    P1 --> P1C0["Simple Factual Questions"]
    P1 --> P1C1["Template-Based Generation"]
    P1 --> P1C2["Classification Tasks"]
    P1 --> P1C3["High-Volume, Low-Latency Applications"]
    ROOT --> P2["Cost and Latency Impact"]
    P2 --> P2C0["Token Costs"]
    P2 --> P2C1["Latency Impact"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b

Budget	Thinking Cost	Typical Response Cost	Total
1,000 tokens	$0.015	$0.015	$0.030
5,000 tokens	$0.075	$0.015	$0.090
10,000 tokens	$0.150	$0.015	$0.165
50,000 tokens	$0.750	$0.015	$0.765

Latency Impact

Thinking tokens must be generated before the response begins, which directly increases time to first token (TTFT):

1,000 thinking tokens: +1-2 seconds TTFT
5,000 thinking tokens: +5-10 seconds TTFT
10,000 thinking tokens: +10-20 seconds TTFT

For interactive applications, keep thinking budgets modest (1,000-5,000 tokens). For offline analysis, larger budgets (10,000-50,000) are acceptable.

Streaming with Extended Thinking

You can stream both the thinking and response phases:

with client.messages.stream(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{"role": "user", "content": "Design a rate limiter for a distributed system."}],
) as stream:
    current_phase = None
    for event in stream:
        if event.type == "content_block_start":
            if event.content_block.type == "thinking":
                current_phase = "thinking"
                print("\n[Thinking...]")
            elif event.content_block.type == "text":
                current_phase = "response"
                print("\n[Response]")
        elif event.type == "content_block_delta":
            if event.delta.type == "thinking_delta":
                pass  # Optionally show thinking to user
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)

Practical Decision Framework

Use this flowchart to decide whether to enable extended thinking:

flowchart TD
    CENTER(("Key Components"))
    CENTER --> N0["Migration planning for large codebases"]
    CENTER --> N1["Incident response procedures"]
    CENTER --> N2["Project decomposition and scheduling"]
    CENTER --> N3["Complex SQL query construction"]
    CENTER --> N4["1,000 thinking tokens: +1-2 seconds TTFT"]
    CENTER --> N5["5,000 thinking tokens: +5-10 seconds TT…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff

Is the task time-sensitive (< 3 second response needed)? -> No extended thinking
Is the answer deterministic or template-based? -> No extended thinking
Does the task involve multi-step reasoning? -> Yes, use 3,000-5,000 budget
Does the task involve complex analysis with tradeoffs? -> Yes, use 5,000-10,000 budget
Is this an offline analysis or batch job? -> Yes, use 10,000-50,000 budget
Is correctness critical (financial, medical, legal)? -> Yes, use maximum budget

Multi-Turn Conversations with Thinking

In multi-turn conversations, previous thinking blocks are included in the conversation history. This means Claude can build on its prior reasoning. However, thinking tokens from previous turns count toward input tokens, which can be expensive.

# First turn with thinking
messages = [{"role": "user", "content": "Design a caching strategy for our API."}]
response1 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=messages,
)

# Second turn -- include previous thinking in history
messages.append({"role": "assistant", "content": response1.content})
messages.append({"role": "user", "content": "Now consider how this works with database read replicas."})

response2 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=messages,
)

Redacting Thinking in Production

In some applications, you may want to use extended thinking for quality but not expose the thinking process to end users. The thinking content is returned in a separate block, making it easy to filter:

def get_response_only(response) -> str:
    """Extract only the text response, discarding thinking blocks."""
    return "".join(
        block.text for block in response.content if block.type == "text"
    )

def get_thinking_only(response) -> str:
    """Extract only thinking blocks for debugging/logging."""
    return "".join(
        block.thinking for block in response.content if block.type == "thinking"
    )

Log the thinking content for debugging and quality analysis, but only return the text response to users.

Claude's Extended Thinking: When to Use It and When Not To