Skip to content
Learn Agentic AI12 min read0 views

UFO's Dual-Agent Architecture: How HostAgent and AppAgent Coordinate Tasks

Deep dive into Microsoft UFO's dual-agent system where HostAgent orchestrates application selection and AppAgent executes in-app UI actions, with detailed coordination flow and plan execution examples.

Why Two Agents Instead of One?

Windows automation is fundamentally different from web automation. A browser has one viewport, one DOM, and a uniform API. A Windows desktop has dozens of applications running simultaneously, each with its own window, menu system, and control hierarchy. A single agent trying to manage both "which app should I use?" and "which button should I click?" would face an overwhelming observation space.

UFO solves this by splitting responsibilities. The HostAgent operates at the desktop level — it sees all open windows, understands which applications are available, and decides where to route each sub-task. The AppAgent operates within a single application — it sees the controls, menus, and content of one window and executes precise UI actions.

This separation of concerns mirrors how humans work. You first decide "I need Excel for this" (HostAgent thinking), then you interact with Excel's ribbons, cells, and menus (AppAgent thinking).

HostAgent: The Orchestrator

The HostAgent is responsible for:

  1. Task decomposition — breaking a complex user request into sub-tasks
  2. Application selection — identifying which Windows application handles each sub-task
  3. Window management — activating, minimizing, and arranging application windows
  4. Handoff coordination — passing control to the AppAgent with clear instructions
# Simplified HostAgent logic
class HostAgent:
    def __init__(self, config: dict):
        self.model = config["HOST_AGENT"]["API_MODEL"]
        self.active_apps = self.detect_open_applications()

    def detect_open_applications(self) -> list[dict]:
        """Use Windows API to enumerate all visible windows."""
        import pywinauto
        desktop = pywinauto.Desktop(backend="uia")
        windows = desktop.windows()
        return [
            {
                "title": w.window_text(),
                "process": w.process_id(),
                "rect": w.rectangle(),
            }
            for w in windows if w.is_visible()
        ]

    def plan_task(self, user_request: str) -> list[dict]:
        """Ask GPT-4V to decompose the task into sub-tasks."""
        screenshot = self.capture_desktop_screenshot()
        prompt = f"""You are a Windows desktop automation planner.

User request: {user_request}

Open applications: {self.active_apps}

Break this into ordered sub-tasks. For each sub-task, specify:
1. The target application
2. The action to perform within that application
3. Any data to transfer between applications"""

        response = call_vision_model(
            model=self.model,
            prompt=prompt,
            image=screenshot
        )
        return parse_subtasks(response)

AppAgent: The Executor

Once the HostAgent selects an application and brings it to the foreground, the AppAgent takes over. It operates in a tight observe-plan-act loop:

class AppAgent:
    def __init__(self, app_window, config: dict):
        self.window = app_window
        self.model = config["APP_AGENT"]["API_MODEL"]
        self.action_history = []
        self.max_steps = config.get("MAX_STEP", 50)

    def execute_task(self, instruction: str) -> bool:
        """Run the observation-action loop until task completes."""
        for step in range(self.max_steps):
            # Observe: capture and annotate current state
            screenshot = self.capture_app_screenshot()
            controls = self.enumerate_controls()
            annotated = self.annotate_screenshot(screenshot, controls)

            # Plan: ask the model what to do next
            action = self.get_next_action(
                annotated_screenshot=annotated,
                instruction=instruction,
                history=self.action_history,
                available_controls=controls
            )

            # Check for completion
            if action["status"] == "FINISH":
                return True

            # Act: execute the planned action
            self.execute_action(action, controls)
            self.action_history.append(action)

        return False  # Max steps exceeded

    def enumerate_controls(self) -> list[dict]:
        """List all interactive UI elements in the window."""
        controls = []
        for element in self.window.descendants():
            if element.is_enabled():
                controls.append({
                    "id": len(controls),
                    "type": element.control_type(),
                    "name": element.window_text(),
                    "rect": element.rectangle(),
                    "automationId": element.automation_id(),
                })
        return controls

The Coordination Flow

A complete task flows through these stages:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

  1. User submits request — "Copy the sales totals from the Excel spreadsheet and paste them into a new Outlook email to the finance team"
  2. HostAgent captures desktop — takes a screenshot showing all open windows
  3. HostAgent decomposes — identifies two sub-tasks: (a) extract data from Excel, (b) compose email in Outlook
  4. HostAgent activates Excel — brings the Excel window to the foreground
  5. AppAgent executes in Excel — navigates to the sales totals, selects and copies the data
  6. HostAgent receives completion signal — AppAgent reports sub-task (a) is done
  7. HostAgent activates Outlook — switches focus to Outlook
  8. AppAgent executes in Outlook — creates new email, sets recipients, pastes data, sends

Plan Representation

UFO internally represents plans as structured action sequences. Each action has a type, target control, and parameters:

{
  "plan": [
    {
      "step": 1,
      "application": "Microsoft Excel",
      "action": "click",
      "target": "Cell A1",
      "description": "Click on cell A1 to start selection"
    },
    {
      "step": 2,
      "application": "Microsoft Excel",
      "action": "keyboard",
      "keys": "Ctrl+Shift+End",
      "description": "Select all data from A1 to the last used cell"
    }
  ]
}

Error Recovery Between Agents

When the AppAgent encounters an error — for example, a dialog box appears unexpectedly — it reports the failure back to the HostAgent. The HostAgent can then decide to retry the sub-task, modify the plan, or skip to an alternative approach.

This error recovery is one of the key advantages of the dual-agent design. A monolithic agent would need to handle both application-level and desktop-level recovery in a single decision space. By separating them, each agent can focus on errors within its domain.

FAQ

Can I add custom agents beyond HostAgent and AppAgent?

UFO's architecture is designed around the two-agent pattern. However, you can extend the AppAgent with custom action handlers or wrap UFO in a higher-level orchestration framework that manages multiple UFO instances for truly complex multi-desktop workflows.

What happens if the HostAgent picks the wrong application?

The AppAgent will fail to find the expected UI elements and report a failure. The HostAgent can then re-evaluate the desktop screenshot and try a different application. In practice, GPT-4o is quite accurate at application identification from window titles and visual appearance.

How does data transfer between applications work?

UFO primarily uses the Windows clipboard for cross-application data transfer — the same mechanism humans use (Ctrl+C, Ctrl+V). For structured data, the AppAgent can also read values from UI elements and pass them as text context to the next sub-task.


#MicrosoftUFO #DualAgent #HostAgent #AppAgent #AgenticArchitecture #WindowsAutomation #MultiAgent #Orchestration

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.