Installing and Configuring Microsoft UFO: Getting Started with Windows Automation

Prerequisites

Before installing UFO, ensure your system meets the following requirements:

Windows 10 or 11 (UFO uses Windows UI Automation APIs that are not available on macOS or Linux)
Python 3.10 or later installed and added to PATH
An OpenAI API key with access to GPT-4V or GPT-4o (vision-capable models)
Git for cloning the repository
At least 8 GB of RAM (screenshots and vision model calls are memory-intensive)

UFO depends on the Windows UI Automation COM interfaces, so it must run on a Windows machine — not WSL, not a Linux VM. If you are developing on macOS or Linux, you will need a Windows machine or a cloud Windows instance.

Step 1: Clone the Repository

UFO is distributed as a GitHub repository, not a PyPI package. Clone it and enter the project directory:

git clone https://github.com/microsoft/UFO.git
cd UFO

Step 2: Create a Virtual Environment and Install Dependencies

Set up an isolated Python environment:

python -m venv .venv
.venv\Scripts\activate

pip install -r requirements.txt

The requirements include openai, Pillow for screenshot handling, pywinauto for Windows UI Automation, and several other dependencies for image processing and control interaction.

Step 3: Configure API Keys

UFO reads its configuration from YAML files in the ufo/config/ directory. The primary file you need to edit is config.yaml. Create it from the template:

copy ufo\config\config.yaml.template ufo\config\config.yaml

Open the file and set your API credentials:

# ufo/config/config.yaml

# OpenAI API configuration
OPENAI_API_TYPE: "openai"
OPENAI_API_KEY: "sk-proj-your-api-key-here"
OPENAI_API_BASE: "https://api.openai.com/v1"
OPENAI_API_VERSION: "2024-02-15-preview"

# Model selection
HOST_AGENT:
  API_MODEL: "gpt-4o"

APP_AGENT:
  API_MODEL: "gpt-4o"

# Screenshot settings
SCREENSHOT_BACKEND: "uia"  # Options: uia, win32
ANNOTATION_COLORS:
  - "#FF0000"
  - "#00FF00"
  - "#0000FF"

The configuration separates model settings for the HostAgent and AppAgent. You can use different models for each — for example, a cheaper model for host-level routing and a more capable model for in-app actions.

Step 4: Configure Azure OpenAI (Optional)

If your organization uses Azure OpenAI Service instead of the public OpenAI API, update the configuration accordingly:

# Azure OpenAI configuration
OPENAI_API_TYPE: "azure"
OPENAI_API_KEY: "your-azure-api-key"
OPENAI_API_BASE: "https://your-resource.openai.azure.com/"
OPENAI_API_VERSION: "2024-02-15-preview"

HOST_AGENT:
  API_MODEL: "your-gpt4o-deployment-name"

APP_AGENT:
  API_MODEL: "your-gpt4o-deployment-name"

Note that you provide the deployment name, not the model name, when using Azure.

Step 5: Run Your First Task

With everything configured, launch UFO:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

python -m ufo --task "Open Notepad and type Hello World"

UFO will:

Launch or find the Notepad application
Capture a screenshot and annotate UI elements
Send the annotated screenshot to GPT-4V
Execute the returned actions (click in the text area, type the text)
Repeat until the task is complete

You will see step-by-step output in the console showing what the agent observes and what actions it takes.

Understanding the Configuration File

Here is a more complete configuration with explanations:

# ufo/config/config.yaml - Full reference

# API Provider: "openai" or "azure"
OPENAI_API_TYPE: "openai"
OPENAI_API_KEY: "sk-proj-..."
OPENAI_API_BASE: "https://api.openai.com/v1"

# Agent model configuration
HOST_AGENT:
  API_MODEL: "gpt-4o"
  MAX_TOKENS: 2048
  TEMPERATURE: 0.1      # Low temperature for deterministic actions

APP_AGENT:
  API_MODEL: "gpt-4o"
  MAX_TOKENS: 4096       # Higher token limit for complex UI analysis
  TEMPERATURE: 0.1

# Execution settings
MAX_STEP: 50             # Maximum steps before aborting a task
SLEEP_TIME: 2            # Seconds to wait between actions (UI settling)
SAFE_GUARD: true         # Require confirmation before destructive actions

# Screenshot configuration
SCREENSHOT_BACKEND: "uia"
INCLUDE_LAST_SCREENSHOTS: 3   # Number of previous screenshots for context
CONCAT_SCREENSHOTS: false      # Whether to tile screenshots side by side

# Logging
LOG_LEVEL: "INFO"
SAVE_SCREENSHOTS: true         # Save annotated screenshots for debugging
LOG_DIR: "logs/"

Step 6: Verify With a Multi-Step Task

Test a more complex workflow to confirm everything works end to end:

python -m ufo --task "Open File Explorer, navigate to Documents, and create a new folder called TestUFO"

Watch the console output as the HostAgent identifies File Explorer as the target application, the AppAgent navigates the folder tree, and the folder creation sequence executes.

Environment Variables as an Alternative

Instead of editing the YAML file directly, you can set configuration values via environment variables. This is useful for CI/CD or containerized setups:

set OPENAI_API_KEY=sk-proj-your-key
set UFO_HOST_MODEL=gpt-4o
set UFO_APP_MODEL=gpt-4o
set UFO_MAX_STEP=30

python -m ufo --task "Your task here"

Troubleshooting Common Setup Issues

"No module named pywinauto": Make sure you activated the virtual environment before running pip install. Run .venv\Scripts\activate again and reinstall.

"Access denied" on screenshot capture: Run your terminal as Administrator. UFO needs elevated permissions to capture screenshots of some applications.

"Model not found" errors: Verify your API key has access to the vision model specified in config. Try gpt-4o as a fallback.

Slow execution: Increase SLEEP_TIME if actions are executing before the UI finishes rendering. Windows animations can cause the agent to see transitional states.

FAQ

Can I use UFO without an OpenAI API key?

UFO requires a vision-capable LLM to interpret screenshots. You can use Azure OpenAI as an alternative, or configure a local model endpoint that supports the OpenAI vision API format, but some form of multimodal model access is required.

Does UFO support multiple monitors?

UFO captures the screen where the target application window is located. Multi-monitor setups work as long as the target application is fully visible on one screen. Split windows across monitors may cause partial screenshots.

How much does it cost to run UFO tasks?

Each step involves sending an annotated screenshot (roughly 1000-2000 tokens for the image) plus prompt tokens to GPT-4o. A simple 5-step task costs approximately $0.05-0.15 USD. Complex multi-application tasks with 30+ steps can cost $0.50-1.00 USD.

#MicrosoftUFO #WindowsSetup #AIAgent #DesktopAutomation #GPT4Vision #PythonAutomation #UIAutomation