Skip to content
Back to Blog
Agentic AI6 min read

Claude's Extended Thinking: When to Use It and When Not To

Understand Claude's extended thinking feature, how it improves reasoning quality for complex tasks, when it adds value vs. unnecessary cost, and implementation patterns for production applications.

What Is Extended Thinking?

Extended thinking is a Claude feature that allocates dedicated reasoning tokens before generating the final response. When enabled, Claude produces a chain-of-thought "thinking" block where it reasons through the problem step by step, then generates its answer based on that reasoning.

This is different from simply asking Claude to "think step by step" in the prompt. Extended thinking uses a separate token budget and processing phase specifically designed for deep reasoning, and the thinking content is returned separately from the response so you can inspect Claude's reasoning process.

How to Enable Extended Thinking

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Up to 128K thinking tokens
    },
    messages=[{
        "role": "user",
        "content": "A farmer needs to cross a river with a wolf, a goat, and a cabbage. The boat can only carry the farmer and one item. If left alone, the wolf will eat the goat, and the goat will eat the cabbage. How can the farmer get everything across safely?"
    }]
)

# The response contains both thinking and text blocks
for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
    elif block.type == "text":
        print("\n=== RESPONSE ===")
        print(block.text)

When Extended Thinking Adds Value

Complex Mathematical Reasoning

Extended thinking dramatically improves accuracy on multi-step math problems. Without it, Claude might skip steps or make arithmetic errors. With it, Claude works through each step methodically.

Benchmark improvement: On the MATH benchmark, extended thinking improves accuracy by 10-20 percentage points compared to standard responses.

Code Architecture Decisions

When designing complex systems, extended thinking helps Claude consider more alternatives, evaluate tradeoffs, and arrive at better-reasoned recommendations:

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{
        "role": "user",
        "content": """Design the database schema for a multi-tenant SaaS application that needs:
- Per-tenant data isolation
- Shared resources for common configurations
- Audit logging for compliance
- Support for 10,000+ tenants with varying data volumes
- Sub-100ms query latency for dashboard queries

Consider row-level security, partitioning strategies, and caching layers."""
    }]
)

Ambiguous Requirements Analysis

When requirements are vague or contradictory, extended thinking helps Claude identify and reason through the ambiguities:

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{
        "role": "user",
        "content": """Our client wants a 'fast, secure, and cheap' authentication system
that supports 'millions of users' with 'zero downtime' and must be built in '2 weeks.'
Identify the tradeoffs and propose a realistic architecture."""
    }]
)

Multi-Step Planning

Extended thinking excels at tasks that require planning multiple steps with dependencies:

  • Migration planning for large codebases
  • Incident response procedures
  • Project decomposition and scheduling
  • Complex SQL query construction

When NOT to Use Extended Thinking

Simple Factual Questions

"What is the capital of France?" does not benefit from extended thinking. The answer is immediate and certain. Thinking tokens are wasted.

Template-Based Generation

Generating emails, form letters, or structured outputs from templates does not require deep reasoning. The overhead of thinking tokens adds cost without improving quality.

Classification Tasks

Binary or multi-class classification is typically a pattern-matching task that does not benefit from extended reasoning:

# DON'T use extended thinking for this
response = client.messages.create(
    model="claude-haiku-4-5-20250514",  # Use Haiku, no thinking
    max_tokens=100,
    messages=[{
        "role": "user",
        "content": "Classify this email as spam or not spam: 'You won $1M! Click here...'"
    }]
)

High-Volume, Low-Latency Applications

Extended thinking adds latency (the thinking phase runs before the response begins) and cost (thinking tokens are billed as output tokens). For chatbots handling thousands of concurrent conversations, the overhead is unjustified for routine queries.

Cost and Latency Impact

Token Costs

Thinking tokens are billed as output tokens. At Claude Sonnet rates:

Budget Thinking Cost Typical Response Cost Total
1,000 tokens $0.015 $0.015 $0.030
5,000 tokens $0.075 $0.015 $0.090
10,000 tokens $0.150 $0.015 $0.165
50,000 tokens $0.750 $0.015 $0.765

Latency Impact

Thinking tokens must be generated before the response begins, which directly increases time to first token (TTFT):

  • 1,000 thinking tokens: +1-2 seconds TTFT
  • 5,000 thinking tokens: +5-10 seconds TTFT
  • 10,000 thinking tokens: +10-20 seconds TTFT

For interactive applications, keep thinking budgets modest (1,000-5,000 tokens). For offline analysis, larger budgets (10,000-50,000) are acceptable.

Streaming with Extended Thinking

You can stream both the thinking and response phases:

with client.messages.stream(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{"role": "user", "content": "Design a rate limiter for a distributed system."}],
) as stream:
    current_phase = None
    for event in stream:
        if event.type == "content_block_start":
            if event.content_block.type == "thinking":
                current_phase = "thinking"
                print("\n[Thinking...]")
            elif event.content_block.type == "text":
                current_phase = "response"
                print("\n[Response]")
        elif event.type == "content_block_delta":
            if event.delta.type == "thinking_delta":
                pass  # Optionally show thinking to user
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)

Practical Decision Framework

Use this flowchart to decide whether to enable extended thinking:

  1. Is the task time-sensitive (< 3 second response needed)? -> No extended thinking
  2. Is the answer deterministic or template-based? -> No extended thinking
  3. Does the task involve multi-step reasoning? -> Yes, use 3,000-5,000 budget
  4. Does the task involve complex analysis with tradeoffs? -> Yes, use 5,000-10,000 budget
  5. Is this an offline analysis or batch job? -> Yes, use 10,000-50,000 budget
  6. Is correctness critical (financial, medical, legal)? -> Yes, use maximum budget

Multi-Turn Conversations with Thinking

In multi-turn conversations, previous thinking blocks are included in the conversation history. This means Claude can build on its prior reasoning. However, thinking tokens from previous turns count toward input tokens, which can be expensive.

# First turn with thinking
messages = [{"role": "user", "content": "Design a caching strategy for our API."}]
response1 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=messages,
)

# Second turn -- include previous thinking in history
messages.append({"role": "assistant", "content": response1.content})
messages.append({"role": "user", "content": "Now consider how this works with database read replicas."})

response2 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=messages,
)

Redacting Thinking in Production

In some applications, you may want to use extended thinking for quality but not expose the thinking process to end users. The thinking content is returned in a separate block, making it easy to filter:

def get_response_only(response) -> str:
    """Extract only the text response, discarding thinking blocks."""
    return "".join(
        block.text for block in response.content if block.type == "text"
    )

def get_thinking_only(response) -> str:
    """Extract only thinking blocks for debugging/logging."""
    return "".join(
        block.thinking for block in response.content if block.type == "thinking"
    )

Log the thinking content for debugging and quality analysis, but only return the text response to users.

Share this article
N

NYC News

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.