Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

This AI learns to click better than you

The model behind UI-R1 is called Qwen2.5-VL-3B—a 3 billion parameter multimodal model, much smaller than the 7B and 18B giants in the game.

byKerem Gülen
March 28, 2025
in Research

Artificial intelligence is finally learning how to navigate your phone screen like a human—except faster, smarter, and with shockingly little practice. A new research project from vivo AI Lab and MMLab at the Chinese University of Hong Kong introduces a model called UI-R1, which rethinks how AI agents are trained to understand and interact with graphical user interfaces (GUIs). And here’s the twist: it doesn’t rely on massive datasets or thousands of GPU hours.

Instead, UI-R1 does something refreshingly clever. It learns through reinforcement learning (RL)—not supervised fine-tuning (SFT), the standard method that requires manually labeled data and expensive training cycles. That means no need to feed it tens of thousands of examples of buttons, scroll bars, or text boxes. Just a carefully selected batch of 136 mobile tasks was enough to build a model that performs better than many larger, heavily trained models on real-world screen tasks.

Let’s unpack why this matters and how it works.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

So what does UI-R1 actually do?

Picture this: you’re looking at a screenshot of a phone screen and someone tells you to “tap the back button.” You look at the layout, figure out where the back button is, and tap it. Seems easy for a human.

Now imagine training an AI to do that. For years, this has meant training huge multimodal models (models that can understand images and text together) to associate commands like “tap back” with the right spot on the screen. That’s what GUI agents like CogAgent, Aria-GUI, and OS-Atlas do—they learn from huge datasets with labeled examples of actions and elements.

But this process is slow, expensive, and doesn’t generalize well. When you move the AI from a phone screen to a desktop interface or a web browser, its performance often tanks. It’s like training a dog to fetch a ball but only in one room of your house—take it outside, and the dog forgets what to do.

UI-R1 changes this. Instead of trying to “memorize” thousands of interface layouts, it learns how to reason about them using reinforcement learning and a clever rule-based reward system.

A smarter reward system, not a bigger model

The model behind UI-R1 is called Qwen2.5-VL-3B—a 3 billion parameter multimodal model, much smaller than the 7B and 18B giants in the game. But UI-R1 fine-tunes it using RL with a unique reward system that doesn’t require human feedback.

This reward function judges the model on three things:

  1. Did it choose the right action type? (Click, scroll, go back, open app, input text)
  2. Did it select the right spot to click? (Coordinates must fall within the correct box)
  3. Did it explain its reasoning clearly and provide a valid final answer? (Using a structured format)

This structured feedback loop helps the model learn to make better predictions over time. Think of it like a game: each time the AI gets closer to the right answer, it scores points based on these rules, and gradually figures out how to win more often.

Importantly, it’s not just learning to guess—it’s learning to explain why it thinks a certain button is the right one to tap. That’s key for building agents you can trust to operate software, apps, and devices.


AI masters language but flunks LEGO 101


Small data, big gains

Here’s where things get wild. UI-R1 was trained on just 136 examples—and it still outperformed many supervised models trained on thousands.

On benchmarks like ScreenSpot and ScreenSpot-Pro, which test how well a model can identify UI elements across platforms (mobile, desktop, and web), UI-R1 delivered grounding accuracies up to 78.6%, beating models like SeeClick (trained on 1 million examples!) and even matching the performance of larger 7B models.

It also aced another benchmark called ANDROIDCONTROL, where it needed to predict both the correct action type and where to apply it. UI-R1 clocked in with an 88.5% average accuracy, outperforming models trained on 76,000 examples—an absurd level of efficiency for just 136 training tasks.

That’s like teaching someone chess by showing them just 10 games—and watching them beat the club champion.

Why does this work so well?

A few things set UI-R1 apart:

  • Rule-based rewards: No need for labeled data or human reviewers. The model scores itself based on simple, structured rules.
  • Reinforcement over repetition: Instead of memorizing answers (as in supervised training), UI-R1 learns strategies that generalize.
  • Carefully selected data: The team didn’t just throw in any training examples. They picked tasks that were hard, diverse, and high-quality. No filler.

And perhaps most importantly, the model isn’t just guessing blindly. Thanks to its “reasoning tokens” and structured output format (<think> and <answer> tags), UI-R1 learns to think through each task. That’s what makes it generalize so well to new environments—even with unfamiliar layouts.

What does this mean for AI interfaces?

This could be the beginning of a new wave of generalist GUI agents. Instead of training bespoke models for each app, platform, or task, we might be able to build compact, adaptable models like UI-R1 that can reason through any screen, any device, any instruction.

  • For developers, it means lower costs, less data, and faster iteration.
  • For users, it could mean smarter virtual assistants that actually understand what you want to do on your screen.
  • For researchers, it’s a proof that reinforcement learning with simple rule-based rewards isn’t just for games and math problems—it’s a real alternative to SFT for interface tasks.

It’s still early

While UI-R1’s results are impressive, there’s more to be done. For example, it still requires clean input formats and carefully written prompts. It also assumes that the device screenshots and instructions are reasonably aligned—a safe assumption in a benchmark setting, but trickier in the messy real world.

Still, it’s a major step forward.

And perhaps most excitingly, it shows that smarter training beats bigger models—at least when it comes to understanding what’s on your screen and figuring out how to act.

In a world where we’re surrounded by increasingly complex software, AI like UI-R1 might soon be the one clicking, scrolling, and tapping on our behalf—with precision, reason, and barely any training at all.


Featured image credit

Tags: AI

Related Posts

Just 250 bad documents can poison a massive AI model

Just 250 bad documents can poison a massive AI model

October 15, 2025
71% of workers are using rogue AI tools at work, Microsoft warns

71% of workers are using rogue AI tools at work, Microsoft warns

October 14, 2025
Google taught your voice assistant to understand what you mean

Google taught your voice assistant to understand what you mean

October 14, 2025
Apple researchers just made AI text generation 128x faster

Apple researchers just made AI text generation 128x faster

October 13, 2025
Have astronomers finally found the universe’s first dark stars?

Have astronomers finally found the universe’s first dark stars?

October 10, 2025
KPMG: CEOs prioritize AI investment in 2025

KPMG: CEOs prioritize AI investment in 2025

October 9, 2025

LATEST NEWS

Microsoft’s biggest-ever Patch Tuesday fixes 175 bugs

Jensen Huang says every Nvidia engineer now codes with Cursor

Apple unveils new iPad Pro with the M5 chip

Apple Vision Pro gets M5 chip upgrade and PS VR2 controller support

Attackers used AI prompts to silently exfiltrate code from GitHub repositories

Android 16 now shows which apps sneak in your security settings

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.