Ferret UI: Apple's new LLM for UI manipulation

Ferret-UI offers a convincing vision for the future of mobile UI interaction with a focus on enhanced visual processing, referring, grounding, and reasoning.

Apr 29, 2024

Large language models (LLMs) have taken the world by storm, demonstrating remarkable abilities in language comprehension and response. However, a significant hurdle remains, they haven't really been able to understand and interact with mobile user interfaces (UIs). Traditional LLMs, trained on natural images, hesitate when presented with the condensed and element-rich world of mobile UI screens. This is where Apple’s Ferret-UI comes in, a new LLM designed specifically to overcome these limitations and achieve a deeper understanding of mobile UI interactions.

The challenges faced by existing LLMs stem from the essential differences between natural images and mobile UIs. Unlike broad landscapes or detailed portraits, mobile UIs present a condensed format with smaller elements crammed into a specific aspect ratio. This confuses standard LLMs, limiting their ability to accurately recognize and interpret UI components. Furthermore, even if individual elements are recognized, current LLMs struggle to understand the relationships between them – a very crucial aspect for understanding the overall functionality of the UI.

Ferret-UI addresses these issues by leveraging the capabilities of existing LLMs while including cool new features. Its "any resolution" capability allows it to effectively handle the condensed nature of mobile UI elements. Essentially, Ferret-UI zooms in on the UI, recognizing enhanced visual features to gain a clearer understanding of its building blocks. An interesting feature might be that users can ask Ferret-UI to scroll through Instagram feeds and read out loud, posts only from their close friends, saving time spent on social media.

Beyond enhanced visual processing, Ferret-UI focuses on three key areas to achieve superior mobile UI comprehension: referring, grounding, and reasoning. Referring allows Ferret-UI to understand user queries that reference specific UI elements. Imagine a user saying, "Click the 'Settings' button." Ferret-UI, with its sharpened referring ability, would pinpoint the exact button on the screen and trigger the action. Grounding bridges the gap between verbal commands and visual elements. When a user says "Open the profile menu," Ferret-UI can not only understand the words but also associate them with the corresponding menu on the UI. Finally, reasoning aids Ferret-UI to go beyond basic recognition. It can analyze the relationships between different UI elements, understanding the purpose of a button based on its location within the UI or the actions it triggers.

The potential applications of Ferret-UI are vast and transformative. Apple seems to be interested in using natural language commands dictate mobile app usage. With Ferret-UI, users could navigate through apps using voice or text commands, making interaction more intuitive and hands-free. Ferret-UI could simulate user interactions with the UI, streamlining the testing process and efficiently identifying potential issues. Additionally, Ferret-UI holds huge potential for improving accessibility for visually impaired users. By acting as an interface between voice commands and mobile UI elements, it could brilliantly assist them in interacting with mobile apps more effectively.

In conclusion, Ferret-UI offers a convincing vision for the future of mobile UI interaction. With its focus on enhanced visual processing, referring, grounding, and reasoning, it lays foundations for a more intuitive and user-friendly mobile experience. As LLM technology continues to evolve, Ferret-UI and similar advancements have the potential to fundamentally change how we interact with the mobile devices that we are already so much dependent in our lives.

Please like and subscribe for more content like this. Also, comment down below if you have insights to share.

The Explained Blog

Discussion about this post