Microsoft's Fara-7B: New Agentic LLM From Screenshots

Microsoft Research released Fara-7B, a 7-billion-parameter agentic small language model for computer use, capable of executing tasks locally from screenshots.

Fara-7B functions as an open-weight Computer Use Agent, predicting mouse and keyboard actions directly from screenshots. Its compact size allows execution on a single user device, which reduces latency and retains browsing data locally. Unlike conventional text-generating chat-oriented Large Language Models (LLMs), Computer Use Agents like Fara-7B control browser or desktop interfaces to complete tasks such as form filling, travel booking, or price comparison. They interpret the screen, analyze page layouts, and then produce low-level actions including clicks, scrolls, types, web searches, or URL visits.

Many current systems utilize large multimodal models integrated with complex scaffolding that analyzes accessibility trees and coordinates various tools. This increases latency and often necessitates server-side deployment. Fara-7B condenses the functionality of such multi-agent systems into a single multimodal decoder-only model, built upon Qwen2.5-VL-7B. It processes browser screenshots and text context, then generates thought text followed by a tool call with grounded arguments, such as coordinates, text, or URLs.

The primary constraint for Computer Use Agents involves data, as high-quality logs of multi-step human web interactions are scarce and expensive to acquire. The Fara project introduces FaraGen, a synthetic data engine that generates and filters web trajectories on live sites.

Video: Microsoft

FaraGen employs a three-stage pipeline. Task Proposal begins with seed URLs from public corpora like ClueWeb22 and Tranco, categorized into domains such as e-commerce, travel, entertainment, or forums. Large language models convert each URL into realistic user tasks, for example, booking specific movie tickets or creating a shopping list with review and material constraints. Tasks must be achievable without login or paywall, fully specified, useful, and automatically verifiable.

Task Solving uses a multi-agent system based on Magentic-One and Magentic-UI. An Orchestrator agent plans high-level strategy and maintains task state. A WebSurfer agent receives accessibility trees and Set-of-Marks screenshots, then issues browser actions via Playwright, including click, type, scroll, visit_url, or web_search. A UserSimulator agent provides follow-up instructions for tasks requiring clarification.

Trajectory Verification uses three LLM-based verifiers. An Alignment Verifier checks that actions and final answers align with the task intent. A Rubric Verifier generates a rubric of subgoals and scores partial completion. A Multimodal Verifier inspects screenshots and the final answer to detect hallucinations and confirm visible evidence supports success. These verifiers demonstrate agreement with human labels on 83.3 percent of cases, with reported false positive and false negative rates around 17 to 18 percent.

After filtering, FaraGen produces 145,603 trajectories with 1,010,797 steps across 70,117 unique domains. Trajectories range from 3 to 84 steps, averaging 6.9 steps and approximately 0.5 unique domains per trajectory, indicating tasks often involve sites not present elsewhere in the dataset. Generating data with premium models like GPT-5 and o3 costs approximately $1 per verified trajectory.

Fara-7B is a multimodal decoder-only model utilizing Qwen2.5-VL-7B as its base. It processes a user goal, current browser screenshots, and the full history of prior thoughts and actions. The context window supports 128,000 tokens. At each step, the model first generates a chain of thought detailing the current state and plan, then outputs a tool call specifying the next action and its arguments.

Video: Microsoft

The tool space aligns with the Magentic-UI computer_use interface, encompassing key, type, mouse_move, left_click, scroll, visit_url, web_search, history_back, pause_and_memorize_fact, wait, and terminate. Coordinates are predicted directly as pixel positions on the screenshot, enabling the model to operate without accessibility tree access during inference.

Training involved supervised finetuning over approximately 1.8 million samples, mixing multiple data sources. These include FaraGen trajectories broken into observe-think-act steps, grounding and UI localization tasks, screenshot-based visual question answering and captioning, and safety and refusal datasets.

Microsoft conducted evaluations of Fara-7B across four live web benchmarks: WebVoyager, Online-Mind2Web, DeepShop, and the new WebTailBench. WebTailBench focuses on underrepresented segments like restaurant reservations, job applications, real estate search, comparison shopping, and multi-site compositional tasks.

On these benchmarks, Fara-7B achieved 73.5 percent success on WebVoyager, 34.1 percent on Online-Mind2Web, 26.2 percent on DeepShop, and 38.4 percent on WebTailBench. This exceeds the 7B Computer Use Agent baseline UI-TARS-1.5-7B, which scored 66.4, 31.3, 11.6, and 19.5 respectively, and compares favorably to larger systems such as OpenAI computer-use-preview and SoM Agent configurations built on GPT-4o.

On WebVoyager, Fara-7B uses an average of 124,000 input tokens and 1,100 output tokens per task, with approximately 16.5 actions. Utilizing market token prices, the research team estimates an average cost of $0.025 per task, compared to around $0.30 for SoM agents supported by proprietary reasoning models like GPT-5 and o3. Fara-7B uses a similar number of input tokens but approximately one-tenth the output tokens of these SoM agents.

Fara-7B: A 7B parameter, open-weight Computer Use Agent built on Qwen2.5-VL-7B.
Operation: Operates directly from screenshots and text, outputs grounded actions without accessibility trees at inference time.
Training Data: 145,603 verified browser trajectories and 1,010,797 steps generated by the FaraGen pipeline across 70,117 domains.
Benchmark Success (WebVoyager): 73.5 percent.
Benchmark Success (Online-Mind2Web): 34.1 percent.
Benchmark Success (DeepShop): 26.2 percent.
Benchmark Success (WebTailBench): 38.4 percent.
Cost on WebVoyager: Approximately $0.025 per task, using 124,000 input tokens and 1,100 output tokens.
Output Token Efficiency: Around an order of magnitude cheaper in output token usage than SoM agents backed by GPT-5 class models.

Fara-7B represents a development toward practical Computer Use Agents capable of local hardware operation with reduced inference costs while maintaining privacy. The integration of Qwen2.5-VL-7B, FaraGen synthetic trajectories, and WebTailBench provides a pathway from multi-agent data generation to a single, compact model that matches or surpasses larger systems on key benchmarks, while incorporating Critical Point and refusal safeguards.

Featured image credit