Multi-Step Web Tasks with GPT Vision: Complex Workflows Across Multiple Pages

Why Single-Step Vision Is Not Enough

Browsing a single page is straightforward, but real web tasks span multiple pages. Booking a flight requires searching, filtering results, selecting a flight, entering passenger details, choosing seats, and confirming payment. Each page looks different, expects different inputs, and may fail in different ways.

A multi-step vision agent needs three capabilities beyond basic screenshot analysis: task decomposition to plan ahead, state tracking to remember what it has done, and verification to confirm each step succeeded before proceeding.

Task Decomposition

Start by having GPT-4V break a high-level task into discrete steps.

from pydantic import BaseModel
from openai import OpenAI

class TaskStep(BaseModel):
    step_number: int
    description: str
    expected_page_type: str  # search, results, form, confirmation
    success_indicator: str  # what to look for to confirm step worked
    data_to_extract: list[str]  # info to capture for later steps

class TaskPlan(BaseModel):
    task_description: str
    steps: list[TaskStep]
    estimated_total_steps: int

client = OpenAI()

def decompose_task(task: str) -> TaskPlan:
    """Break a complex web task into steps."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a web task planner. Break complex web tasks "
                    "into discrete steps. Each step should represent one "
                    "page interaction or page transition. Include what "
                    "success looks like for each step and what data needs "
                    "to be extracted for subsequent steps."
                ),
            },
            {
                "role": "user",
                "content": f"Plan the steps for this task: {task}",
            },
        ],
        response_format=TaskPlan,
    )
    return response.choices[0].message.parsed

State Tracking Across Pages

The agent must maintain state as it moves through pages. This includes data extracted from earlier steps, which step it is on, and any errors encountered.

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum

class StepStatus(str, Enum):
    PENDING = "pending"
    IN_PROGRESS = "in_progress"
    COMPLETED = "completed"
    FAILED = "failed"
    RETRYING = "retrying"

@dataclass
class WorkflowState:
    task: str
    plan: TaskPlan
    current_step: int = 0
    extracted_data: dict = field(default_factory=dict)
    step_statuses: dict[int, StepStatus] = field(default_factory=dict)
    screenshots: list[str] = field(default_factory=list)
    errors: list[str] = field(default_factory=list)
    started_at: datetime = field(default_factory=datetime.now)

    @property
    def current_task_step(self) -> TaskStep | None:
        if self.current_step < len(self.plan.steps):
            return self.plan.steps[self.current_step]
        return None

    def advance(self):
        """Move to the next step."""
        self.step_statuses[self.current_step] = StepStatus.COMPLETED
        self.current_step += 1
        if self.current_step < len(self.plan.steps):
            self.step_statuses[self.current_step] = StepStatus.IN_PROGRESS

    def record_error(self, error: str):
        """Record an error for the current step."""
        self.errors.append(
            f"Step {self.current_step}: {error}"
        )
        self.step_statuses[self.current_step] = StepStatus.FAILED

    def get_context_summary(self) -> str:
        """Summarize state for the GPT-4V prompt."""
        lines = [f"Task: {self.task}"]
        lines.append(f"Current step: {self.current_step + 1} "
                      f"of {len(self.plan.steps)}")
        if self.extracted_data:
            lines.append("Extracted data so far:")
            for k, v in self.extracted_data.items():
                lines.append(f"  - {k}: {v}")
        if self.errors:
            lines.append(f"Previous errors: {self.errors[-3:]}")
        return "\n".join(lines)

The Multi-Step Execution Engine

The engine ties together planning, execution, verification, and state management.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import asyncio
import base64
from playwright.async_api import async_playwright, Page

class StepResult(BaseModel):
    success: bool
    action_taken: str
    extracted_data: dict[str, str]
    error: str
    next_action: str  # what to do next: proceed, retry, escalate

class MultiStepAgent:
    def __init__(self, max_retries: int = 2):
        self.client = OpenAI()
        self.max_retries = max_retries

    async def capture(self, page: Page) -> str:
        screenshot = await page.screenshot(type="png")
        return base64.b64encode(screenshot).decode()

    async def execute_step(
        self, page: Page, state: WorkflowState
    ) -> StepResult:
        """Execute a single step with vision guidance."""
        step = state.current_task_step
        screenshot = await self.capture(page)

        response = self.client.beta.chat.completions.parse(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a web automation agent executing a "
                        "multi-step workflow. Analyze the current page "
                        "and determine the action needed for this step. "
                        "The viewport is 1280x720."
                    ),
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": (
                                f"{state.get_context_summary()}\n\n"
                                f"Current step: {step.description}\n"
                                f"Success indicator: "
                                f"{step.success_indicator}\n"
                                f"Data to extract: "
                                f"{step.data_to_extract}\n\n"
                                "Analyze the page and report the result."
                            ),
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": (
                                    "data:image/png;base64,"
                                    f"{screenshot}"
                                ),
                                "detail": "high",
                            },
                        },
                    ],
                },
            ],
            response_format=StepResult,
        )
        return response.choices[0].message.parsed

    async def run_workflow(self, url: str, task: str) -> WorkflowState:
        """Run a complete multi-step workflow."""
        plan = decompose_task(task)
        state = WorkflowState(task=task, plan=plan)
        state.step_statuses[0] = StepStatus.IN_PROGRESS

        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page(
                viewport={"width": 1280, "height": 720}
            )
            await page.goto(url, wait_until="networkidle")

            while state.current_step < len(plan.steps):
                retries = 0
                while retries <= self.max_retries:
                    result = await self.execute_step(page, state)

                    if result.success:
                        state.extracted_data.update(
                            result.extracted_data
                        )
                        state.advance()
                        await asyncio.sleep(1)
                        break

                    retries += 1
                    if retries > self.max_retries:
                        state.record_error(result.error)
                        await browser.close()
                        return state

                    await asyncio.sleep(2)

            await browser.close()

        return state

Handling Page Transitions

Page transitions are the trickiest part of multi-step workflows. After clicking a link or submitting a form, the page URL may change, content may load asynchronously, or a modal may appear instead of a navigation.

async def wait_for_page_change(
    page: Page, previous_url: str, timeout: int = 10000
) -> bool:
    """Wait for a page transition or significant content change."""
    try:
        await page.wait_for_url(
            lambda url: url != previous_url, timeout=timeout
        )
        await page.wait_for_load_state("networkidle")
        return True
    except Exception:
        # URL might not change (modal, SPA navigation)
        await asyncio.sleep(1)
        return False

FAQ

How do I handle workflows that require authentication?

Authenticate before starting the workflow. Use Playwright's storage_state to save and restore cookies and local storage. You can log in once manually, save the state with context.storage_state(path="auth.json"), then reuse it in subsequent runs with browser.new_context(storage_state="auth.json").

What happens when a step fails partway through a multi-step workflow?

The state tracker records exactly which step failed and why. You have three recovery options: retry the failed step, restart from a known checkpoint (e.g., after login), or escalate to a human operator with the full state and screenshots for manual completion. The extracted_data dictionary preserves everything learned in previous steps.

How do I prevent the agent from getting stuck in infinite loops?

Set hard limits at multiple levels: a maximum number of retries per step (2-3), a maximum total number of actions across the workflow (50), and a wall-clock timeout (5-10 minutes). If any limit is hit, the agent stops and returns the current state for debugging.

#MultiStepTasks #GPTVision #WebWorkflows #StateTracking #TaskDecomposition #BrowserAutomation #AgenticAI #Python

Multi-Step Web Tasks with GPT Vision: Complex Workflows Across Multiple Pages

Why Single-Step Vision Is Not Enough

Task Decomposition

State Tracking Across Pages

The Multi-Step Execution Engine

Handling Page Transitions

FAQ

How do I handle workflows that require authentication?

What happens when a step fails partway through a multi-step workflow?

How do I prevent the agent from getting stuck in infinite loops?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding