Multi-Step Web Tasks with GPT Vision: Complex Workflows Across Multiple Pages
Build GPT Vision agents that handle complex multi-step web workflows spanning multiple pages. Learn task decomposition, state tracking, page transition handling, and verification at each step.
Why Single-Step Vision Is Not Enough
Browsing a single page is straightforward, but real web tasks span multiple pages. Booking a flight requires searching, filtering results, selecting a flight, entering passenger details, choosing seats, and confirming payment. Each page looks different, expects different inputs, and may fail in different ways.
A multi-step vision agent needs three capabilities beyond basic screenshot analysis: task decomposition to plan ahead, state tracking to remember what it has done, and verification to confirm each step succeeded before proceeding.
Task Decomposition
Start by having GPT-4V break a high-level task into discrete steps.
from pydantic import BaseModel
from openai import OpenAI
class TaskStep(BaseModel):
step_number: int
description: str
expected_page_type: str # search, results, form, confirmation
success_indicator: str # what to look for to confirm step worked
data_to_extract: list[str] # info to capture for later steps
class TaskPlan(BaseModel):
task_description: str
steps: list[TaskStep]
estimated_total_steps: int
client = OpenAI()
def decompose_task(task: str) -> TaskPlan:
"""Break a complex web task into steps."""
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a web task planner. Break complex web tasks "
"into discrete steps. Each step should represent one "
"page interaction or page transition. Include what "
"success looks like for each step and what data needs "
"to be extracted for subsequent steps."
),
},
{
"role": "user",
"content": f"Plan the steps for this task: {task}",
},
],
response_format=TaskPlan,
)
return response.choices[0].message.parsed
State Tracking Across Pages
The agent must maintain state as it moves through pages. This includes data extracted from earlier steps, which step it is on, and any errors encountered.
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
class StepStatus(str, Enum):
PENDING = "pending"
IN_PROGRESS = "in_progress"
COMPLETED = "completed"
FAILED = "failed"
RETRYING = "retrying"
@dataclass
class WorkflowState:
task: str
plan: TaskPlan
current_step: int = 0
extracted_data: dict = field(default_factory=dict)
step_statuses: dict[int, StepStatus] = field(default_factory=dict)
screenshots: list[str] = field(default_factory=list)
errors: list[str] = field(default_factory=list)
started_at: datetime = field(default_factory=datetime.now)
@property
def current_task_step(self) -> TaskStep | None:
if self.current_step < len(self.plan.steps):
return self.plan.steps[self.current_step]
return None
def advance(self):
"""Move to the next step."""
self.step_statuses[self.current_step] = StepStatus.COMPLETED
self.current_step += 1
if self.current_step < len(self.plan.steps):
self.step_statuses[self.current_step] = StepStatus.IN_PROGRESS
def record_error(self, error: str):
"""Record an error for the current step."""
self.errors.append(
f"Step {self.current_step}: {error}"
)
self.step_statuses[self.current_step] = StepStatus.FAILED
def get_context_summary(self) -> str:
"""Summarize state for the GPT-4V prompt."""
lines = [f"Task: {self.task}"]
lines.append(f"Current step: {self.current_step + 1} "
f"of {len(self.plan.steps)}")
if self.extracted_data:
lines.append("Extracted data so far:")
for k, v in self.extracted_data.items():
lines.append(f" - {k}: {v}")
if self.errors:
lines.append(f"Previous errors: {self.errors[-3:]}")
return "\n".join(lines)
The Multi-Step Execution Engine
The engine ties together planning, execution, verification, and state management.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import asyncio
import base64
from playwright.async_api import async_playwright, Page
class StepResult(BaseModel):
success: bool
action_taken: str
extracted_data: dict[str, str]
error: str
next_action: str # what to do next: proceed, retry, escalate
class MultiStepAgent:
def __init__(self, max_retries: int = 2):
self.client = OpenAI()
self.max_retries = max_retries
async def capture(self, page: Page) -> str:
screenshot = await page.screenshot(type="png")
return base64.b64encode(screenshot).decode()
async def execute_step(
self, page: Page, state: WorkflowState
) -> StepResult:
"""Execute a single step with vision guidance."""
step = state.current_task_step
screenshot = await self.capture(page)
response = self.client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a web automation agent executing a "
"multi-step workflow. Analyze the current page "
"and determine the action needed for this step. "
"The viewport is 1280x720."
),
},
{
"role": "user",
"content": [
{
"type": "text",
"text": (
f"{state.get_context_summary()}\n\n"
f"Current step: {step.description}\n"
f"Success indicator: "
f"{step.success_indicator}\n"
f"Data to extract: "
f"{step.data_to_extract}\n\n"
"Analyze the page and report the result."
),
},
{
"type": "image_url",
"image_url": {
"url": (
"data:image/png;base64,"
f"{screenshot}"
),
"detail": "high",
},
},
],
},
],
response_format=StepResult,
)
return response.choices[0].message.parsed
async def run_workflow(self, url: str, task: str) -> WorkflowState:
"""Run a complete multi-step workflow."""
plan = decompose_task(task)
state = WorkflowState(task=task, plan=plan)
state.step_statuses[0] = StepStatus.IN_PROGRESS
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page(
viewport={"width": 1280, "height": 720}
)
await page.goto(url, wait_until="networkidle")
while state.current_step < len(plan.steps):
retries = 0
while retries <= self.max_retries:
result = await self.execute_step(page, state)
if result.success:
state.extracted_data.update(
result.extracted_data
)
state.advance()
await asyncio.sleep(1)
break
retries += 1
if retries > self.max_retries:
state.record_error(result.error)
await browser.close()
return state
await asyncio.sleep(2)
await browser.close()
return state
Handling Page Transitions
Page transitions are the trickiest part of multi-step workflows. After clicking a link or submitting a form, the page URL may change, content may load asynchronously, or a modal may appear instead of a navigation.
async def wait_for_page_change(
page: Page, previous_url: str, timeout: int = 10000
) -> bool:
"""Wait for a page transition or significant content change."""
try:
await page.wait_for_url(
lambda url: url != previous_url, timeout=timeout
)
await page.wait_for_load_state("networkidle")
return True
except Exception:
# URL might not change (modal, SPA navigation)
await asyncio.sleep(1)
return False
FAQ
How do I handle workflows that require authentication?
Authenticate before starting the workflow. Use Playwright's storage_state to save and restore cookies and local storage. You can log in once manually, save the state with context.storage_state(path="auth.json"), then reuse it in subsequent runs with browser.new_context(storage_state="auth.json").
What happens when a step fails partway through a multi-step workflow?
The state tracker records exactly which step failed and why. You have three recovery options: retry the failed step, restart from a known checkpoint (e.g., after login), or escalate to a human operator with the full state and screenshots for manual completion. The extracted_data dictionary preserves everything learned in previous steps.
How do I prevent the agent from getting stuck in infinite loops?
Set hard limits at multiple levels: a maximum number of retries per step (2-3), a maximum total number of actions across the workflow (50), and a wall-clock timeout (5-10 minutes). If any limit is hit, the agent stops and returns the current state for debugging.
#MultiStepTasks #GPTVision #WebWorkflows #StateTracking #TaskDecomposition #BrowserAutomation #AgenticAI #Python
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.