UFO's Dual-Agent Architecture: How HostAgent and AppAgent Coordinate Tasks

Why Two Agents Instead of One?

Windows automation is fundamentally different from web automation. A browser has one viewport, one DOM, and a uniform API. A Windows desktop has dozens of applications running simultaneously, each with its own window, menu system, and control hierarchy. A single agent trying to manage both "which app should I use?" and "which button should I click?" would face an overwhelming observation space.

UFO solves this by splitting responsibilities. The HostAgent operates at the desktop level — it sees all open windows, understands which applications are available, and decides where to route each sub-task. The AppAgent operates within a single application — it sees the controls, menus, and content of one window and executes precise UI actions.

This separation of concerns mirrors how humans work. You first decide "I need Excel for this" (HostAgent thinking), then you interact with Excel's ribbons, cells, and menus (AppAgent thinking).

HostAgent: The Orchestrator

The HostAgent is responsible for:

Task decomposition — breaking a complex user request into sub-tasks
Application selection — identifying which Windows application handles each sub-task
Window management — activating, minimizing, and arranging application windows
Handoff coordination — passing control to the AppAgent with clear instructions

# Simplified HostAgent logic
class HostAgent:
    def __init__(self, config: dict):
        self.model = config["HOST_AGENT"]["API_MODEL"]
        self.active_apps = self.detect_open_applications()

    def detect_open_applications(self) -> list[dict]:
        """Use Windows API to enumerate all visible windows."""
        import pywinauto
        desktop = pywinauto.Desktop(backend="uia")
        windows = desktop.windows()
        return [
            {
                "title": w.window_text(),
                "process": w.process_id(),
                "rect": w.rectangle(),
            }
            for w in windows if w.is_visible()
        ]

    def plan_task(self, user_request: str) -> list[dict]:
        """Ask GPT-4V to decompose the task into sub-tasks."""
        screenshot = self.capture_desktop_screenshot()
        prompt = f"""You are a Windows desktop automation planner.

User request: {user_request}

Open applications: {self.active_apps}

Break this into ordered sub-tasks. For each sub-task, specify:
1. The target application
2. The action to perform within that application
3. Any data to transfer between applications"""

        response = call_vision_model(
            model=self.model,
            prompt=prompt,
            image=screenshot
        )
        return parse_subtasks(response)

AppAgent: The Executor

Once the HostAgent selects an application and brings it to the foreground, the AppAgent takes over. It operates in a tight observe-plan-act loop:

class AppAgent:
    def __init__(self, app_window, config: dict):
        self.window = app_window
        self.model = config["APP_AGENT"]["API_MODEL"]
        self.action_history = []
        self.max_steps = config.get("MAX_STEP", 50)

    def execute_task(self, instruction: str) -> bool:
        """Run the observation-action loop until task completes."""
        for step in range(self.max_steps):
            # Observe: capture and annotate current state
            screenshot = self.capture_app_screenshot()
            controls = self.enumerate_controls()
            annotated = self.annotate_screenshot(screenshot, controls)

            # Plan: ask the model what to do next
            action = self.get_next_action(
                annotated_screenshot=annotated,
                instruction=instruction,
                history=self.action_history,
                available_controls=controls
            )

            # Check for completion
            if action["status"] == "FINISH":
                return True

            # Act: execute the planned action
            self.execute_action(action, controls)
            self.action_history.append(action)

        return False  # Max steps exceeded

    def enumerate_controls(self) -> list[dict]:
        """List all interactive UI elements in the window."""
        controls = []
        for element in self.window.descendants():
            if element.is_enabled():
                controls.append({
                    "id": len(controls),
                    "type": element.control_type(),
                    "name": element.window_text(),
                    "rect": element.rectangle(),
                    "automationId": element.automation_id(),
                })
        return controls

The Coordination Flow

A complete task flows through these stages:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

User submits request — "Copy the sales totals from the Excel spreadsheet and paste them into a new Outlook email to the finance team"
HostAgent captures desktop — takes a screenshot showing all open windows
HostAgent decomposes — identifies two sub-tasks: (a) extract data from Excel, (b) compose email in Outlook
HostAgent activates Excel — brings the Excel window to the foreground
AppAgent executes in Excel — navigates to the sales totals, selects and copies the data
HostAgent receives completion signal — AppAgent reports sub-task (a) is done
HostAgent activates Outlook — switches focus to Outlook
AppAgent executes in Outlook — creates new email, sets recipients, pastes data, sends

Plan Representation

UFO internally represents plans as structured action sequences. Each action has a type, target control, and parameters:

{
  "plan": [
    {
      "step": 1,
      "application": "Microsoft Excel",
      "action": "click",
      "target": "Cell A1",
      "description": "Click on cell A1 to start selection"
    },
    {
      "step": 2,
      "application": "Microsoft Excel",
      "action": "keyboard",
      "keys": "Ctrl+Shift+End",
      "description": "Select all data from A1 to the last used cell"
    }
  ]
}

Error Recovery Between Agents

When the AppAgent encounters an error — for example, a dialog box appears unexpectedly — it reports the failure back to the HostAgent. The HostAgent can then decide to retry the sub-task, modify the plan, or skip to an alternative approach.

This error recovery is one of the key advantages of the dual-agent design. A monolithic agent would need to handle both application-level and desktop-level recovery in a single decision space. By separating them, each agent can focus on errors within its domain.

FAQ

Can I add custom agents beyond HostAgent and AppAgent?

UFO's architecture is designed around the two-agent pattern. However, you can extend the AppAgent with custom action handlers or wrap UFO in a higher-level orchestration framework that manages multiple UFO instances for truly complex multi-desktop workflows.

What happens if the HostAgent picks the wrong application?

The AppAgent will fail to find the expected UI elements and report a failure. The HostAgent can then re-evaluate the desktop screenshot and try a different application. In practice, GPT-4o is quite accurate at application identification from window titles and visual appearance.

How does data transfer between applications work?

UFO primarily uses the Windows clipboard for cross-application data transfer — the same mechanism humans use (Ctrl+C, Ctrl+V). For structured data, the AppAgent can also read values from UI elements and pass them as text context to the next sub-task.

#MicrosoftUFO #DualAgent #HostAgent #AppAgent #AgenticArchitecture #WindowsAutomation #MultiAgent #Orchestration

UFO's Dual-Agent Architecture: How HostAgent and AppAgent Coordinate Tasks

Why Two Agents Instead of One?

HostAgent: The Orchestrator

AppAgent: The Executor

The Coordination Flow

Plan Representation

Error Recovery Between Agents

FAQ

Can I add custom agents beyond HostAgent and AppAgent?

What happens if the HostAgent picks the wrong application?

How does data transfer between applications work?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding