Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

This AI learns to click better than you

The model behind UI-R1 is called Qwen2.5-VL-3B—a 3 billion parameter multimodal model, much smaller than the 7B and 18B giants in the game.

byKerem Gülen
March 28, 2025
in Research
Home Research
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

Artificial intelligence is finally learning how to navigate your phone screen like a human—except faster, smarter, and with shockingly little practice. A new research project from vivo AI Lab and MMLab at the Chinese University of Hong Kong introduces a model called UI-R1, which rethinks how AI agents are trained to understand and interact with graphical user interfaces (GUIs). And here’s the twist: it doesn’t rely on massive datasets or thousands of GPU hours.

Instead, UI-R1 does something refreshingly clever. It learns through reinforcement learning (RL)—not supervised fine-tuning (SFT), the standard method that requires manually labeled data and expensive training cycles. That means no need to feed it tens of thousands of examples of buttons, scroll bars, or text boxes. Just a carefully selected batch of 136 mobile tasks was enough to build a model that performs better than many larger, heavily trained models on real-world screen tasks.

Let’s unpack why this matters and how it works.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

So what does UI-R1 actually do?

Picture this: you’re looking at a screenshot of a phone screen and someone tells you to “tap the back button.” You look at the layout, figure out where the back button is, and tap it. Seems easy for a human.

Now imagine training an AI to do that. For years, this has meant training huge multimodal models (models that can understand images and text together) to associate commands like “tap back” with the right spot on the screen. That’s what GUI agents like CogAgent, Aria-GUI, and OS-Atlas do—they learn from huge datasets with labeled examples of actions and elements.

But this process is slow, expensive, and doesn’t generalize well. When you move the AI from a phone screen to a desktop interface or a web browser, its performance often tanks. It’s like training a dog to fetch a ball but only in one room of your house—take it outside, and the dog forgets what to do.

UI-R1 changes this. Instead of trying to “memorize” thousands of interface layouts, it learns how to reason about them using reinforcement learning and a clever rule-based reward system.

A smarter reward system, not a bigger model

The model behind UI-R1 is called Qwen2.5-VL-3B—a 3 billion parameter multimodal model, much smaller than the 7B and 18B giants in the game. But UI-R1 fine-tunes it using RL with a unique reward system that doesn’t require human feedback.

This reward function judges the model on three things:

  1. Did it choose the right action type? (Click, scroll, go back, open app, input text)
  2. Did it select the right spot to click? (Coordinates must fall within the correct box)
  3. Did it explain its reasoning clearly and provide a valid final answer? (Using a structured format)

This structured feedback loop helps the model learn to make better predictions over time. Think of it like a game: each time the AI gets closer to the right answer, it scores points based on these rules, and gradually figures out how to win more often.

Importantly, it’s not just learning to guess—it’s learning to explain why it thinks a certain button is the right one to tap. That’s key for building agents you can trust to operate software, apps, and devices.


AI masters language but flunks LEGO 101


Small data, big gains

Here’s where things get wild. UI-R1 was trained on just 136 examples—and it still outperformed many supervised models trained on thousands.

On benchmarks like ScreenSpot and ScreenSpot-Pro, which test how well a model can identify UI elements across platforms (mobile, desktop, and web), UI-R1 delivered grounding accuracies up to 78.6%, beating models like SeeClick (trained on 1 million examples!) and even matching the performance of larger 7B models.

It also aced another benchmark called ANDROIDCONTROL, where it needed to predict both the correct action type and where to apply it. UI-R1 clocked in with an 88.5% average accuracy, outperforming models trained on 76,000 examples—an absurd level of efficiency for just 136 training tasks.

That’s like teaching someone chess by showing them just 10 games—and watching them beat the club champion.

Why does this work so well?

A few things set UI-R1 apart:

  • Rule-based rewards: No need for labeled data or human reviewers. The model scores itself based on simple, structured rules.
  • Reinforcement over repetition: Instead of memorizing answers (as in supervised training), UI-R1 learns strategies that generalize.
  • Carefully selected data: The team didn’t just throw in any training examples. They picked tasks that were hard, diverse, and high-quality. No filler.

And perhaps most importantly, the model isn’t just guessing blindly. Thanks to its “reasoning tokens” and structured output format (<think> and <answer> tags), UI-R1 learns to think through each task. That’s what makes it generalize so well to new environments—even with unfamiliar layouts.

What does this mean for AI interfaces?

This could be the beginning of a new wave of generalist GUI agents. Instead of training bespoke models for each app, platform, or task, we might be able to build compact, adaptable models like UI-R1 that can reason through any screen, any device, any instruction.

  • For developers, it means lower costs, less data, and faster iteration.
  • For users, it could mean smarter virtual assistants that actually understand what you want to do on your screen.
  • For researchers, it’s a proof that reinforcement learning with simple rule-based rewards isn’t just for games and math problems—it’s a real alternative to SFT for interface tasks.

It’s still early

While UI-R1’s results are impressive, there’s more to be done. For example, it still requires clean input formats and carefully written prompts. It also assumes that the device screenshots and instructions are reasonably aligned—a safe assumption in a benchmark setting, but trickier in the messy real world.

Still, it’s a major step forward.

And perhaps most excitingly, it shows that smarter training beats bigger models—at least when it comes to understanding what’s on your screen and figuring out how to act.

In a world where we’re surrounded by increasingly complex software, AI like UI-R1 might soon be the one clicking, scrolling, and tapping on our behalf—with precision, reason, and barely any training at all.


Featured image credit

Tags: AI

Related Posts

Faith in large employers is fading among UK workers

Faith in large employers is fading among UK workers

June 5, 2026
Army-funded scientists explore a new frontier in quantum physics

Army-funded scientists explore a new frontier in quantum physics

June 5, 2026
New MIT process could make lithium production cheaper and cleaner

New MIT process could make lithium production cheaper and cleaner

June 4, 2026
Researchers create AI worm that adapts attacks without human input

Researchers create AI worm that adapts attacks without human input

June 4, 2026
Researchers unlock 20-fold enhancement in ultrafast laser experiments

Researchers unlock 20-fold enhancement in ultrafast laser experiments

June 3, 2026
NASA tests next-gen radiation-hardened space computer chip

NASA tests next-gen radiation-hardened space computer chip

May 29, 2026

LATEST NEWS

OpenAI unveils first official partner program with $150M backing

Apple is preparing three major new features for iOS 27

Google files lawsuit over AI-assisted phishing operation abusing Gemini

“Free robots are an illusion”: Why we’ll pay for system intelligence, not delivery workers

How Henrique Schmaiske led Meteor.js through its biggest transformation

Proven privacy: Why ‘no-log’ claims need real evidence today

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Stratup.ai

Roboto AI

Pickaxe

Pfpmaker

MindPal

Syllaby

ScreenApp

FinanceBrain

GitHub Spark

Hints

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.