Skip to content
Learn Agentic AI13 min read0 views

Building a Claude Browser Agent: Automated Web Navigation with Anthropic SDK

Step-by-step guide to building a browser automation agent with Claude Computer Use — from SDK setup and screenshot capture to executing click, type, and scroll actions for real web navigation tasks.

Setting Up the Environment

Building a Claude browser agent requires three components: the Anthropic Python SDK, a browser that can be controlled programmatically for screenshot capture, and an input simulation layer. We will use Playwright for browser management (to launch and screenshot) while letting Claude drive all the navigation decisions.

Start by installing the dependencies:

# requirements.txt
anthropic>=0.39.0
playwright>=1.40.0
Pillow>=10.0.0

Initialize the project:

pip install -r requirements.txt
playwright install chromium

Architecture of the Browser Agent

The agent architecture has three layers:

  1. Browser Manager — Launches a headless or headed Chromium instance, navigates to a starting URL, captures screenshots, and executes low-level browser actions
  2. Action Executor — Translates Claude's computer use tool calls into Playwright mouse and keyboard commands
  3. Agent Loop — Orchestrates the screenshot-action cycle and manages the conversation history with Claude

Here is the complete browser manager:

import asyncio
from playwright.async_api import async_playwright, Page, Browser
import base64

class BrowserManager:
    def __init__(self, width: int = 1280, height: int = 800):
        self.width = width
        self.height = height
        self.browser: Browser | None = None
        self.page: Page | None = None

    async def start(self, url: str = "about:blank"):
        pw = await async_playwright().start()
        self.browser = await pw.chromium.launch(headless=False)
        context = await self.browser.new_context(
            viewport={"width": self.width, "height": self.height}
        )
        self.page = await context.new_page()
        await self.page.goto(url)

    async def screenshot(self) -> str:
        """Capture current page as base64 PNG."""
        img_bytes = await self.page.screenshot(full_page=False)
        return base64.standard_b64encode(img_bytes).decode()

    async def click(self, x: int, y: int, button: str = "left"):
        await self.page.mouse.click(x, y, button=button)

    async def type_text(self, text: str):
        await self.page.keyboard.type(text, delay=50)

    async def press_key(self, key: str):
        await self.page.keyboard.press(key)

    async def scroll(self, x: int, y: int, direction: str):
        await self.page.mouse.move(x, y)
        delta = 300 if direction == "down" else -300
        await self.page.mouse.wheel(0, delta)

    async def close(self):
        if self.browser:
            await self.browser.close()

The Agent Loop

The agent loop ties everything together. It sends screenshots to Claude, processes tool calls, executes actions, and repeats until the task is done:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import anthropic

class ClaudeBrowserAgent:
    def __init__(self, browser: BrowserManager):
        self.browser = browser
        self.client = anthropic.Anthropic()
        self.messages = []
        self.model = "claude-sonnet-4-20250514"

    async def run(self, task: str, max_steps: int = 30):
        self.messages = [{"role": "user", "content": task}]

        for step in range(max_steps):
            screenshot_b64 = await self.browser.screenshot()

            self.messages.append({
                "role": "user",
                "content": [{
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    },
                }],
            })

            response = self.client.messages.create(
                model=self.model,
                max_tokens=1024,
                tools=[{
                    "type": "computer_20241022",
                    "name": "computer",
                    "display_width_px": self.browser.width,
                    "display_height_px": self.browser.height,
                    "display_number": 0,
                }],
                messages=self.messages,
            )

            if response.stop_reason == "end_turn":
                final_text = next(
                    (b.text for b in response.content if hasattr(b, "text")),
                    "Task complete"
                )
                print(f"Done: {final_text}")
                return final_text

            assistant_content = response.content
            self.messages.append({"role": "assistant", "content": assistant_content})

            for block in assistant_content:
                if block.type == "tool_use":
                    await self._execute(block.input)
                    self.messages.append({
                        "role": "user",
                        "content": [{
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": "Action executed successfully",
                        }],
                    })
                    await asyncio.sleep(1)  # Wait for page to render

        return "Max steps reached"

    async def _execute(self, action: dict):
        action_type = action.get("action", action.get("type"))
        if action_type == "click":
            x, y = action["coordinate"]
            await self.browser.click(x, y)
        elif action_type == "type":
            await self.browser.type_text(action["text"])
        elif action_type == "key":
            await self.browser.press_key(action["text"])
        elif action_type == "scroll":
            x, y = action["coordinate"]
            await self.browser.scroll(x, y, action["direction"])

Running the Agent

Here is how to use the agent for a real web navigation task:

async def main():
    browser = BrowserManager(width=1280, height=800)
    await browser.start("https://news.ycombinator.com")

    agent = ClaudeBrowserAgent(browser)
    result = await agent.run(
        "Find the top story on Hacker News and click on the comments link. "
        "Then tell me how many comments the story has."
    )
    print(result)
    await browser.close()

asyncio.run(main())

The agent will take a screenshot of the Hacker News homepage, identify the top story, locate the comments link, click it, take another screenshot of the comments page, and report the comment count back to you.

Optimizing Conversation History

A critical performance consideration is managing the message history. Each screenshot consumes a significant number of tokens. If your task requires 20 steps, you are sending 20 high-resolution images in the conversation. This gets expensive and eventually hits context limits.

A practical optimization is to maintain a sliding window of recent screenshots while summarizing older interactions as text:

def trim_history(messages: list, keep_last: int = 5) -> list:
    """Keep only the last N screenshot exchanges."""
    trimmed = [messages[0]]  # Keep original task
    image_exchanges = [m for m in messages[1:] if _has_image(m)]

    if len(image_exchanges) > keep_last:
        trimmed.append({
            "role": "user",
            "content": f"[Previous {len(image_exchanges) - keep_last} "
                       f"steps completed successfully]"
        })

    # Keep last N exchanges intact
    start_idx = max(1, len(messages) - keep_last * 3)
    trimmed.extend(messages[start_idx:])
    return trimmed

FAQ

Can I use a headless browser with Claude Computer Use?

Yes, and it is recommended for server-side deployments. Playwright supports headless mode, and the screenshots are identical to what you would see in a headed browser. Set headless=True when launching the browser.

How do I handle pages that take time to load?

Add a short delay (1-2 seconds) after executing each action before capturing the next screenshot. For pages with dynamic content, you can also use Playwright's wait_for_load_state("networkidle") before taking the screenshot.

What is the cost per step of the agent loop?

Each step involves sending a screenshot image plus the conversation history to Claude. A 1280x800 screenshot typically costs around 1,000-1,500 input tokens. With the conversation context, expect roughly 2,000-5,000 tokens per step. At Claude Sonnet pricing, a 20-step task costs approximately $0.15-$0.40 depending on conversation length.


#ClaudeBrowserAgent #WebAutomation #AnthropicSDK #ComputerUse #AIBrowserAgent #PythonAutomation #AgenticAI

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.