Building a Claude Browser Agent: Automated Web Navigation with Anthropic SDK
Step-by-step guide to building a browser automation agent with Claude Computer Use — from SDK setup and screenshot capture to executing click, type, and scroll actions for real web navigation tasks.
Setting Up the Environment
Building a Claude browser agent requires three components: the Anthropic Python SDK, a browser that can be controlled programmatically for screenshot capture, and an input simulation layer. We will use Playwright for browser management (to launch and screenshot) while letting Claude drive all the navigation decisions.
Start by installing the dependencies:
# requirements.txt
anthropic>=0.39.0
playwright>=1.40.0
Pillow>=10.0.0
Initialize the project:
pip install -r requirements.txt
playwright install chromium
Architecture of the Browser Agent
The agent architecture has three layers:
- Browser Manager — Launches a headless or headed Chromium instance, navigates to a starting URL, captures screenshots, and executes low-level browser actions
- Action Executor — Translates Claude's computer use tool calls into Playwright mouse and keyboard commands
- Agent Loop — Orchestrates the screenshot-action cycle and manages the conversation history with Claude
Here is the complete browser manager:
import asyncio
from playwright.async_api import async_playwright, Page, Browser
import base64
class BrowserManager:
def __init__(self, width: int = 1280, height: int = 800):
self.width = width
self.height = height
self.browser: Browser | None = None
self.page: Page | None = None
async def start(self, url: str = "about:blank"):
pw = await async_playwright().start()
self.browser = await pw.chromium.launch(headless=False)
context = await self.browser.new_context(
viewport={"width": self.width, "height": self.height}
)
self.page = await context.new_page()
await self.page.goto(url)
async def screenshot(self) -> str:
"""Capture current page as base64 PNG."""
img_bytes = await self.page.screenshot(full_page=False)
return base64.standard_b64encode(img_bytes).decode()
async def click(self, x: int, y: int, button: str = "left"):
await self.page.mouse.click(x, y, button=button)
async def type_text(self, text: str):
await self.page.keyboard.type(text, delay=50)
async def press_key(self, key: str):
await self.page.keyboard.press(key)
async def scroll(self, x: int, y: int, direction: str):
await self.page.mouse.move(x, y)
delta = 300 if direction == "down" else -300
await self.page.mouse.wheel(0, delta)
async def close(self):
if self.browser:
await self.browser.close()
The Agent Loop
The agent loop ties everything together. It sends screenshots to Claude, processes tool calls, executes actions, and repeats until the task is done:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import anthropic
class ClaudeBrowserAgent:
def __init__(self, browser: BrowserManager):
self.browser = browser
self.client = anthropic.Anthropic()
self.messages = []
self.model = "claude-sonnet-4-20250514"
async def run(self, task: str, max_steps: int = 30):
self.messages = [{"role": "user", "content": task}]
for step in range(max_steps):
screenshot_b64 = await self.browser.screenshot()
self.messages.append({
"role": "user",
"content": [{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
},
}],
})
response = self.client.messages.create(
model=self.model,
max_tokens=1024,
tools=[{
"type": "computer_20241022",
"name": "computer",
"display_width_px": self.browser.width,
"display_height_px": self.browser.height,
"display_number": 0,
}],
messages=self.messages,
)
if response.stop_reason == "end_turn":
final_text = next(
(b.text for b in response.content if hasattr(b, "text")),
"Task complete"
)
print(f"Done: {final_text}")
return final_text
assistant_content = response.content
self.messages.append({"role": "assistant", "content": assistant_content})
for block in assistant_content:
if block.type == "tool_use":
await self._execute(block.input)
self.messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": "Action executed successfully",
}],
})
await asyncio.sleep(1) # Wait for page to render
return "Max steps reached"
async def _execute(self, action: dict):
action_type = action.get("action", action.get("type"))
if action_type == "click":
x, y = action["coordinate"]
await self.browser.click(x, y)
elif action_type == "type":
await self.browser.type_text(action["text"])
elif action_type == "key":
await self.browser.press_key(action["text"])
elif action_type == "scroll":
x, y = action["coordinate"]
await self.browser.scroll(x, y, action["direction"])
Running the Agent
Here is how to use the agent for a real web navigation task:
async def main():
browser = BrowserManager(width=1280, height=800)
await browser.start("https://news.ycombinator.com")
agent = ClaudeBrowserAgent(browser)
result = await agent.run(
"Find the top story on Hacker News and click on the comments link. "
"Then tell me how many comments the story has."
)
print(result)
await browser.close()
asyncio.run(main())
The agent will take a screenshot of the Hacker News homepage, identify the top story, locate the comments link, click it, take another screenshot of the comments page, and report the comment count back to you.
Optimizing Conversation History
A critical performance consideration is managing the message history. Each screenshot consumes a significant number of tokens. If your task requires 20 steps, you are sending 20 high-resolution images in the conversation. This gets expensive and eventually hits context limits.
A practical optimization is to maintain a sliding window of recent screenshots while summarizing older interactions as text:
def trim_history(messages: list, keep_last: int = 5) -> list:
"""Keep only the last N screenshot exchanges."""
trimmed = [messages[0]] # Keep original task
image_exchanges = [m for m in messages[1:] if _has_image(m)]
if len(image_exchanges) > keep_last:
trimmed.append({
"role": "user",
"content": f"[Previous {len(image_exchanges) - keep_last} "
f"steps completed successfully]"
})
# Keep last N exchanges intact
start_idx = max(1, len(messages) - keep_last * 3)
trimmed.extend(messages[start_idx:])
return trimmed
FAQ
Can I use a headless browser with Claude Computer Use?
Yes, and it is recommended for server-side deployments. Playwright supports headless mode, and the screenshots are identical to what you would see in a headed browser. Set headless=True when launching the browser.
How do I handle pages that take time to load?
Add a short delay (1-2 seconds) after executing each action before capturing the next screenshot. For pages with dynamic content, you can also use Playwright's wait_for_load_state("networkidle") before taking the screenshot.
What is the cost per step of the agent loop?
Each step involves sending a screenshot image plus the conversation history to Claude. A 1280x800 screenshot typically costs around 1,000-1,500 input tokens. With the conversation context, expect roughly 2,000-5,000 tokens per step. At Claude Sonnet pricing, a 20-step task costs approximately $0.15-$0.40 depending on conversation length.
#ClaudeBrowserAgent #WebAutomation #AnthropicSDK #ComputerUse #AIBrowserAgent #PythonAutomation #AgenticAI
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.