OmniParser V2
OmniParser V2 enhances GUI automation by converting UI screenshots into structured, interpretable elements for large language models (LLMs). This tokenization process enables LLMs to predict subsequent actions based on the parsed interactable elements in the screenshots. OmniParser V2 improves upon its predecessor by offering higher accuracy in detecting smaller interactable elements and faster inference speeds, thanks to a larger training dataset and reduced latency icon caption models.
Specifically, OmniParser V2, in conjunction with GPT-4o, achieves a state-of-the-art average accuracy of 39.6 on the ScreenSpot Pro grounding benchmark, a significant leap from GPT-4o’s original score of 0.8. This version supports various advanced LLMs, including OpenAI, DeepSeek, Qwen, and Anthropic, facilitating enhanced screen understanding, grounding, action planning, and execution.
To promote responsible AI use, OmniParser V2 is trained with Responsible AI data to avoid inferring sensitive attributes and is recommended for non-harmful content. The accompanying OmniTool, a dockerized Windows system, includes essential tools and safety measures such as threat model analysis and sandbox environments to ensure secure and ethical deployment.
Pricing
The pricing model maximizes earning rates using state-dependent pricing and an admissions threshold, where customers with the same valuation (V) and waiting cost (v) may not join the queue if the customer count exceeds the threshold. The optimal threshold is derived, showing how it grows with V.