- Find out how Jabra's new partnership with Zoom benefits you
Earlier this year, Google announced its initiative to provide computer usage functionalities to developers via the Gemini API.
The company is introducing the Gemini 2.5 Computer Use model, our latest specialized framework founded on the visual comprehension and reasoning capabilities of Gemini 2.5 Pro, which enables agents to engage with user interfaces (UIs).
This model surpasses leading alternatives across various web and mobile control benchmarks, all while maintaining reduced latency. Developers can leverage these functionalities through the Gemini API available in Google AI Studio and Vertex AI.
How AI Models Work
While AI models can interact with software through structured APIs, numerous digital tasks still necessitate direct engagement with graphical user interfaces, such as filling out and submitting forms.
To accomplish these tasks, agents need to navigate web pages and applications in a manner akin to human interaction: through clicking, typing, and scrolling.
The proficiency to natively complete forms, manipulate interactive components like dropdowns and filters, and operate behind logins represents a significant advancement in the development of robust, versatile agents.
Operational Overview
The model’s fundamental capabilities are accessible through the new `computer_use` tool in the Gemini API and should function within a continual loop.
Inputs to the tool consist of the user request, a screenshot of the environment, and a log of recent actions.
The input can also indicate whether to omit specific functions from the comprehensive list of supported UI actions or to include supplemental custom functions.
Diagram of AI agent loop: An initial task yields a screenshot/context, which is dispatched to the Model, which subsequently returns a response to the computer environment to execute an action.
Flow of the Gemini 2.5 Computer Use Model
The model then examines these inputs and generates a response, generally a function call corresponding to one of the UI actions, such as clicking or typing.
This response may also incorporate a request for end user confirmation, which is necessary for certain actions, like processing a purchase. The client-side code then implements the received action.
Upon executing the action, a new screenshot of the GUI and the current URL are relayed back to the Computer Use model as a function response, thereby resetting the loop.
This iterative process persists until the task reaches completion, an error arises, or the interaction is halted by a safety response or user decision.
The Gemini 2.5 Computer Use model is predominantly optimized for web browsers but also exhibits strong potential for mobile UI control tasks. However, it has not yet been optimized for desktop operating system-level control.
Performance Overview
The Gemini 2.5 Computer Utilization model exhibits exceptional performance across a variety of web and mobile control benchmarks.
The table below presents results based on self-reported data, assessments conducted by Browserbase, and evaluations we executed independently.
Detailed evaluation information can be found in the Gemini 2.5 Computer Use evaluation documentation and Browserbase's blog article. Unless stated otherwise, the scores displayed pertain to computer usage tools available through API.
Benchmark performance overview: Gemini 2.5 Computer Use excels in Online-Mind2Web, WebVoyager, and AndroidWorld benchmarks.
Comments
Post a Comment