Computer Using Agent Sample App
This document introduces how to use the OpenAI API to build a sample application called "Computer Using Agent (CUA)". CUA is an intelligent agent capable of understanding computer screenshots and performing corresponding actions, such as clicking and typing text.
Main contents include:
- Basic concepts: Introduces how CUA works: by
observing screenshots, the model suggests corresponding actions
(such as
click
,type
). You need to execute these actions in the environment and provide new screenshots for the model to continue making decisions. - Code structure: Introduces two main abstract
classes,
Computer
andAgent
.Computer
is responsible for executing operations issued by CUA (e.g., clicking on the screen), whileAgent
is responsible for repeatedly invoking the model until all computer operations and function calls are processed. - Execution method: Provides ways to run CUA via the command-line interface (CLI). It can use local browsers (via Playwright), Docker containers, or remote browser services (Browserbase, Scrapybara) as different "computer" environments.
- Computer environments: Details the configuration and operation methods of various "computer" environments, including required dependencies and API keys.
- Function calls: The CUA Agent can call functions.
If the function is defined in the
Computer
class, the call will be routed toComputer
for execution. This allows you to extend the functionality of CUA, such as providingback()
orgoto(url)
functions to help CUA navigate. - Security risks: Emphasizes the risks of using CUA and recommends referring to the official documentation for related safety measures.
Use cases:
CUA can be applied to automate computer tasks, such as:
- Web browsing automation: Automatically search for information, fill out forms, and shop online.
- Software operation automation: Automatically execute specific processes in software, such as data entry and file management.
- Assisting people with disabilities: Helping people with mobility issues use computers.
- Process automation and RPA (Robotic Process Automation): Replacing manual repetitive computer operations to improve efficiency.
- Automated testing: Simulating user behavior to perform automated software testing.
In short, this sample application provides a starting point for developers to build an intelligent agent that can use a computer like a human. However, it should be noted that this technology is still in the preview stage and has potential security risks, so it should be used with caution.