Skip to content
Learn Agentic AI
Learn Agentic AI16 min read0 views

Claude Opus 4.6 with 1M Context Window: Complete Developer Guide for Agentic AI

Complete guide to Claude Opus 4.6 GA — 1M context at standard pricing, 128K output tokens, adaptive thinking, and production patterns for building agentic AI systems.

Claude Opus 4.6: The Full Picture

Anthropic released Claude Opus 4.6 to general availability in March 2026, and it represents the most significant capability jump in the Claude model family since Claude 3 Opus. The headline numbers: 1 million token context window at standard pricing ($5 per million input tokens, $25 per million output tokens), 128K output token limit, adaptive thinking that dynamically adjusts reasoning depth, support for up to 600 images or PDF pages per request, and across-the-board improvements in coding, reasoning, and instruction following.

For developers building agentic AI systems, Opus 4.6 changes the calculus on several architectural decisions. The 1M context window means agents can hold entire codebases, long conversation histories, and comprehensive tool result sets without retrieval augmentation. The 128K output limit enables agents to generate complete implementations, not just snippets. And adaptive thinking lets agents automatically allocate more reasoning effort to harder problems.

Getting Started with the Anthropic SDK

The fastest way to start using Opus 4.6 is through the official Anthropic Python or TypeScript SDK. The API is identical to previous Claude models — the new capabilities are accessed through model selection and parameter configuration.

import anthropic

client = anthropic.Anthropic()

# Basic completion with Opus 4.6
response = client.messages.create(
    model="claude-opus-4-6-20260301",
    max_tokens=16384,
    messages=[
        {
            "role": "user",
            "content": "Analyze the architectural tradeoffs between event "
                       "sourcing and CRUD for a high-throughput order "
                       "management system."
        }
    ],
)

print(response.content[0].text)
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

For agentic use cases, you will typically use tool use (function calling), system prompts, and multi-turn conversations. Here is a more complete agent setup.

import anthropic
import json

client = anthropic.Anthropic()

# Define tools for the agent
tools = [
    {
        "name": "search_codebase",
        "description": "Search the codebase for files matching a pattern "
                       "or containing specific text.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query (file name pattern or "
                                   "text content to search for)",
                },
                "file_type": {
                    "type": "string",
                    "description": "Filter by file extension (e.g., .py, .ts)",
                },
                "max_results": {
                    "type": "integer",
                    "description": "Maximum number of results to return",
                    "default": 10,
                },
            },
            "required": ["query"],
        },
    },
    {
        "name": "read_file",
        "description": "Read the contents of a file at the given path.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Absolute path to the file",
                },
            },
            "required": ["path"],
        },
    },
    {
        "name": "write_file",
        "description": "Write content to a file, creating it if it does "
                       "not exist or overwriting if it does.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Absolute path to the file",
                },
                "content": {
                    "type": "string",
                    "description": "Content to write to the file",
                },
            },
            "required": ["path", "content"],
        },
    },
    {
        "name": "run_command",
        "description": "Execute a shell command and return its output.",
        "input_schema": {
            "type": "object",
            "properties": {
                "command": {
                    "type": "string",
                    "description": "The shell command to execute",
                },
                "timeout": {
                    "type": "integer",
                    "description": "Timeout in seconds",
                    "default": 30,
                },
            },
            "required": ["command"],
        },
    },
]

# Agent loop
messages = [
    {
        "role": "user",
        "content": "Find all API routes in the project that don't have "
                   "authentication middleware, and add it to each one.",
    }
]

while True:
    response = client.messages.create(
        model="claude-opus-4-6-20260301",
        max_tokens=16384,
        system="You are a senior software engineer. Use the available "
               "tools to complete tasks autonomously. Think step by step "
               "about what you need to do before taking action.",
        tools=tools,
        messages=messages,
    )

    # Check if the agent wants to use tools
    if response.stop_reason == "tool_use":
        # Extract tool use blocks
        tool_uses = [
            block for block in response.content
            if block.type == "tool_use"
        ]

        # Add assistant message with tool calls
        messages.append({"role": "assistant", "content": response.content})

        # Execute each tool and collect results
        tool_results = []
        for tool_use in tool_uses:
            result = execute_tool(tool_use.name, tool_use.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": tool_use.id,
                "content": json.dumps(result),
            })

        messages.append({"role": "user", "content": tool_results})
    else:
        # Agent is done — print final response
        print(response.content[0].text)
        break

This agent loop pattern is the foundation of every Claude-powered agentic system. The model decides which tools to call, the application executes them, and the results are fed back for the next iteration.

Leveraging the 1M Context Window

The 1M context window is not just a bigger input buffer — it changes what is architecturally possible. Previous context limits (100K-200K tokens) forced developers to use retrieval-augmented generation (RAG) for anything beyond a single long document. With 1M tokens, you can fit approximately 750,000 words or 3,000 pages of text in a single prompt.

For agentic applications, this means:

Entire codebases in context. A medium-sized project (50,000 lines of code) fits comfortably in the context window. Agents can understand the full codebase without retrieval, making their code modifications more architecturally consistent.

Complete conversation histories. An agent handling a complex multi-day task can keep the entire conversation history in context rather than summarizing or truncating it. This eliminates the information loss that degrades agent performance in long-running tasks.

Rich tool result accumulation. An agent that makes 30 tool calls, each returning 1-2K tokens of results, uses only 30-60K tokens — a fraction of the 1M limit. There is no need to truncate or summarize intermediate results.

# Using 1M context to analyze an entire codebase
import os

def collect_codebase(root_dir: str, extensions: list[str]) -> str:
    """Collect all source files into a single context string."""
    files = []
    total_tokens_estimate = 0

    for dirpath, _, filenames in os.walk(root_dir):
        for filename in filenames:
            if any(filename.endswith(ext) for ext in extensions):
                filepath = os.path.join(dirpath, filename)
                with open(filepath, "r") as f:
                    content = f.read()

                relative_path = os.path.relpath(filepath, root_dir)
                file_block = f"--- {relative_path} ---
{content}
"
                files.append(file_block)
                total_tokens_estimate += len(content) // 4

    print(f"Collected {len(files)} files, ~{total_tokens_estimate} tokens")
    return "
".join(files)

codebase = collect_codebase("./src", [".py", ".ts", ".tsx"])

response = client.messages.create(
    model="claude-opus-4-6-20260301",
    max_tokens=32768,
    messages=[
        {
            "role": "user",
            "content": f"Here is the complete codebase:

"
                       f"{codebase}

"
                       f"Identify all security vulnerabilities, rank them "
                       f"by severity, and provide fixes for the top 5.",
        }
    ],
)

However, there is a cost-performance tradeoff. Processing 1M input tokens at $5/M costs $5 per request. If your agent makes 10 such requests during a task, that is $50 in input tokens alone. Use the full context strategically — for initial codebase analysis and complex reasoning — but use targeted retrieval for routine tool calls where only a small context is needed.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Adaptive Thinking: Dynamic Reasoning Depth

Adaptive thinking is perhaps the most architecturally significant new feature in Claude 4.6. Previously, extended thinking had to be configured statically — you either enabled it with a fixed token budget or left it off. Adaptive thinking lets Claude decide dynamically how much reasoning effort to apply based on the complexity of the current step.

# Enabling adaptive thinking
response = client.messages.create(
    model="claude-opus-4-6-20260301",
    max_tokens=16384,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000,  # Max thinking tokens per response
    },
    messages=[
        {
            "role": "user",
            "content": "What is 2 + 2?"
        }
    ],
)

# For simple questions, Claude uses minimal thinking tokens
# For complex questions, it uses more — up to the budget

# Check thinking usage
for block in response.content:
    if block.type == "thinking":
        print(f"Thinking tokens used: {len(block.thinking) // 4}")
    elif block.type == "text":
        print(f"Response: {block.text}")

For agent architectures, adaptive thinking is valuable because agent steps vary dramatically in complexity. A simple file read does not need deep reasoning, but deciding which files to modify and how to refactor them does. With adaptive thinking, the agent automatically allocates reasoning effort where it matters.

# Agent with adaptive thinking for variable-complexity tasks
async def run_adaptive_agent(goal: str, tools: list):
    """Agent that uses adaptive thinking for complex decisions."""
    messages = [{"role": "user", "content": goal}]

    while True:
        response = client.messages.create(
            model="claude-opus-4-6-20260301",
            max_tokens=16384,
            thinking={
                "type": "enabled",
                "budget_tokens": 8000,
            },
            system=(
                "You are an autonomous agent. For each step:
"
                "1. Think about what you need to do next
"
                "2. Choose the best tool for the job
"
                "3. Execute and evaluate the result
"
                "4. Decide if you need more steps or are done

"
                "Use careful reasoning for architectural decisions "
                "and quick action for routine operations."
            ),
            tools=tools,
            messages=messages,
        )

        # Log thinking effort for observability
        thinking_blocks = [
            b for b in response.content if b.type == "thinking"
        ]
        if thinking_blocks:
            thinking_tokens = sum(
                len(b.thinking) // 4 for b in thinking_blocks
            )
            print(f"  Thinking effort: ~{thinking_tokens} tokens")

        if response.stop_reason == "tool_use":
            messages.append({
                "role": "assistant",
                "content": response.content,
            })
            tool_results = await execute_tools(response.content)
            messages.append({"role": "user", "content": tool_results})
        else:
            return extract_final_answer(response)

The observability aspect is important — by logging thinking token usage per step, you can identify which steps the model finds most challenging and potentially optimize your tool design or prompt engineering for those cases.

128K Output Tokens: Complete Implementations

The 128K output token limit (approximately 96,000 words) enables agents to generate complete implementations in a single response. Previous models capped at 4K-8K output tokens, forcing developers to split generation across multiple requests and stitch the results together.

For coding agents, this means you can ask for an entire module, a complete test suite, or a full migration script in one response. For document generation agents, entire reports or analyses can be generated without chunking.

# Generating a complete module with 128K output capacity
response = client.messages.create(
    model="claude-opus-4-6-20260301",
    max_tokens=65536,  # Up to 128K, but use what you need
    messages=[
        {
            "role": "user",
            "content": (
                "Generate a complete Python module for an event sourcing "
                "system with the following components:
"
                "1. Event store (PostgreSQL-backed)
"
                "2. Aggregate base class with snapshot support
"
                "3. Event handlers with retry logic
"
                "4. Projection builder for read models
"
                "5. Complete test suite with pytest fixtures
"
                "6. Migration scripts for the PostgreSQL schema

"
                "Include type hints, docstrings, error handling, and "
                "production-ready logging throughout."
            ),
        }
    ],
)

# The response can contain the entire module — thousands of lines
print(f"Output tokens: {response.usage.output_tokens}")

Multimodal Agent Capabilities

Opus 4.6 supports up to 600 images or PDF pages per request, making it possible to build agents that work with visual content at scale. A document processing agent can ingest an entire PDF (hundreds of pages), extract structured data, and take actions based on the content — all in a single conversation turn.

import anthropic
import base64

client = anthropic.Anthropic()

def encode_pdf_pages(pdf_path: str) -> list[dict]:
    """Encode PDF pages as base64 for the API."""
    # Using a PDF library to extract pages as images
    import fitz  # PyMuPDF

    doc = fitz.open(pdf_path)
    pages = []

    for page_num in range(len(doc)):
        page = doc[page_num]
        pix = page.get_pixmap(dpi=150)
        img_bytes = pix.tobytes("png")
        b64 = base64.standard_b64encode(img_bytes).decode("utf-8")

        pages.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": b64,
            },
        })

    return pages

# Build a document analysis agent
pdf_pages = encode_pdf_pages("quarterly_report.pdf")

response = client.messages.create(
    model="claude-opus-4-6-20260301",
    max_tokens=32768,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Analyze this quarterly report. "
                 "Extract all financial metrics, identify trends, and "
                 "flag any anomalies compared to typical patterns."},
                *pdf_pages,  # Up to 600 pages
            ],
        }
    ],
)

Cost Optimization Strategies

At $5 per million input tokens and $25 per million output tokens, Opus 4.6 is powerful but not cheap for high-volume agent workloads. Here are practical strategies for managing costs.

Use prompt caching. Anthropic's prompt caching reduces costs for repeated prefixes (system prompts, tool definitions, static context). The cached portion costs $0.50/M tokens instead of $5/M — a 90% reduction on the cached portion.

Cascade between models. Use Sonnet 4.6 ($3/$15) for routine agent steps and Opus 4.6 for complex reasoning steps. An agent orchestrator can classify step complexity and route to the appropriate model.

Minimize unnecessary context. Just because you can send 1M tokens does not mean you should. For routine tool calls, send only the relevant context — not the entire codebase. Reserve the full context window for steps that genuinely benefit from comprehensive understanding.

# Model cascading: use Sonnet for simple steps, Opus for complex ones
def select_model(step_type: str, complexity: str) -> str:
    """Route to the appropriate model based on step complexity."""
    if step_type in ("file_read", "simple_search", "status_check"):
        return "claude-sonnet-4-6-20260301"  # $3/$15

    if complexity == "high" or step_type in (
        "architecture_decision",
        "security_review",
        "complex_refactor",
    ):
        return "claude-opus-4-6-20260301"  # $5/$25

    return "claude-sonnet-4-6-20260301"  # Default to Sonnet

FAQ

When should I use Opus 4.6 vs Sonnet 4.6 for agents?

Use Opus 4.6 when your agent handles tasks requiring deep reasoning, complex multi-step planning, or nuanced understanding of large codebases. Use Sonnet 4.6 for agents that primarily execute well-defined workflows with simpler decision points. Many production systems use both — Opus for planning and complex steps, Sonnet for execution and routine operations. The cost difference ($5/$25 vs $3/$15) makes cascading worthwhile at scale.

Does the 1M context window affect latency?

Yes. Time-to-first-token increases with context length. For a 1M token input, expect 10-30 seconds for the first token depending on server load. For a 10K token input, expect 1-3 seconds. If latency matters for your use case, use the minimum context necessary for each step and reserve the full 1M window for steps that genuinely need comprehensive context.

How does adaptive thinking interact with tool use?

When adaptive thinking is enabled, Claude will think before deciding which tools to call and how to interpret tool results. For simple tool calls (reading a file), minimal thinking is used. For complex decisions (which of 5 possible approaches to take), more thinking tokens are consumed. The thinking budget is per-response, not per-tool-call, so a response that calls multiple tools shares the budget across the planning for all of them.

Can I use prompt caching with the 1M context window?

Yes, and you should. Prompt caching works with contexts up to the full 1M token limit. The cached prefix (system prompt, tool definitions, static context) is stored server-side and reused across requests. For a 500K token cached prefix, you save $2.25 per request compared to uncached pricing. The cache has a 5-minute TTL, so it works well for agents that make multiple requests in quick succession.


#ClaudeOpus46 #1MContext #Anthropic #AgenticAI #DeveloperGuide #AdaptiveThinking #128KOutput #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.