Ai2 Drops MolmoWeb - Open-Source Web Agent Beats GPT-4o

The Allen Institute for AI shipped MolmoWeb today - an open-source web browsing agent that controls a browser the same way a human does: by looking at screenshots, reasoning about what's on screen, and clicking. No HTML parsing. No accessibility trees. Just pixels and a model smart enough to navigate them.

The release covers everything: model weights, training code, the full dataset pipeline, and a live demo. It's Apache 2.0 across the board. That puts it in a different category from OpenAI Operator, Google Project Mariner, and Anthropic's computer use - all of which remain closed or API-only.

Key Specs

Spec	Value
Models	MolmoWeb-4B, MolmoWeb-8B
Language backbone	Qwen3-8B (8B variant)
Vision encoder	SigLIP 2 (google/siglip-so400m-patch14-384)
WebVoyager (8B)	78.2% pass@1, 94.7% pass@4
License	Apache 2.0
Code	github.com/allenai/molmoweb
Models	huggingface.co/allenai/MolmoWeb-8B

How It Works

MolmoWeb runs a tight loop. It receives a natural-language task, takes a screenshot of the current browser state, generates a reasoning chain about what to do next, then executes one action using normalized screen coordinates. Repeat until done.

Supported actions: click, type, scroll, navigate to URL, open new tab. The model maintains an action history across steps, so it can recover when pages redirect unexpectedly or when a click lands on the wrong element.

No HTML, No Accessibility Tree

The decision to operate purely from screenshots is deliberate. Accessibility trees are brittle - single-page apps often break them, and many dynamic elements aren't exposed at all. A screenshot-based agent sees exactly what the user sees, including canvas elements, custom widgets, and anything that doesn't have proper ARIA markup.

The tradeoff is harder text reading. OCR-like errors from compressed screenshots are a documented limitation, and the agent can misread small type or compressed images.

Training Without Distillation

MolmoWeb wasn't trained by copying traces from proprietary vision models. Ai2 generated synthetic trajectories from text-only accessibility-tree agents, augmented with human demonstrations. That matters for licensing: there's no question of terms-of-service violation or inherited restrictions from GPT-4o or Claude.

The resulting dataset, MolmoWebMix, is the largest public human web interaction dataset released to date:

MolmoWeb-HumanTrajs: 36,000 human task trajectories across 1,100+ websites, covering 590,000+ individual subtask demonstrations
MolmoWeb-SyntheticTrajs: 108,000 synthetically generated trajectories from accessibility-tree agents and multi-agent pipelines
MolmoWeb-SyntheticGround: 362,000 screenshot QA pairs (~2.2 million total examples) for GUI perception training

All three datasets are on Hugging Face under the allenai/molmoweb-data collection.

MolmoWeb demo interface showing the web agent navigating a website via screenshots The MolmoWeb demo interface, running on molmoweb.allen.ai. The agent reads the screen as a human would and outputs normalized click coordinates. Source: allenai.org

Benchmark Results

The 8B model sits above every publicly available open model at comparable scale and lands within range of agents built on frontier closed models.

Agent	WebVoyager	Notes
OpenAI CUA	87.0%	Closed, GPT-4o based
Surfer-H + Holo1-7B	92.2%	Closed system
MolmoWeb-8B	78.2%	Open weights + data
Microsoft Fara-7B	73.5%	Open weights, less training data
UI-TARS-1.5-7B	~72%	Open weights
Holo1-7B standalone	~70%	Open weights
MolmoWeb-4B	Beats Fara-7B	At matching step budgets

On Online-Mind2Web, the 8B model reaches 35.3% pass@1 - rising to 60.5% with pass@4 test-time scaling. On DeepShop, it hits 42.3%. On WebTailBench, 49.5%. The ScreenSpot v2 visual grounding numbers exceed both Claude 3.7 Computer Use and OpenAI CUA.

The pass@4 jumps are the more interesting signal. WebVoyager goes from 78.2% to 94.7% with four attempts per task. That's test-time compute scaling working on a 8B model, which means the failure modes are often recoverable, not fundamental.

MolmoWeb benchmark comparison chart showing performance across WebVoyager, Mind2Web, and other web agent benchmarks Benchmark comparison from the Ai2 blog post. MolmoWeb-8B beats open-weight alternatives at the same scale across multiple web navigation tasks. Source: allenai.org

The Ai2 Context

This release lands during one of the messier weeks in Ai2's recent history. CEO Ali Farhadi stepped down on March 12, replaced on an interim basis by Peter Clark - the same researcher who held the role between Oren Etzioni and Farhadi. Today, Microsoft announced it had hired Farhadi, along with Hanna Hajishirzi (who leads OLMo development at Ai2), Ranjay Krishna, and COO Sophie Lebrecht, all for Mustafa Suleyman's superintelligence team at Microsoft AI.

The departure has a structural cause: the Fund for Science and Technology (FFST), the $3.1 billion foundation that funds Ai2 per Paul Allen's instructions, is shifting from guaranteed annual grants to proposal-based processes that favor applied work over frontier model research. The board acknowledged publicly that "the cost to do extreme-scale open model research is extraordinary" and "really hard to do inside a nonprofit."

Ai2 isn't shutting down. All 2026 programs remain funded, and OLMo, Molmo, and the agent work continue. But the MolmoWeb release reads partly as a capstone - a statement from a team doing serious work while it still can.

Peter Clark's statement on the direction: "Our mission remains unchanged: advancing AI research and engineering for the common good, and turning our open breakthroughs into lasting, real-world impact."

The Allen Institute for AI (Ai2) building in Seattle The Allen Institute for AI is a Seattle-based nonprofit research institute founded in 2014 by Paul Allen. Source: Wikimedia Commons

Ai2's decision to release MolmoWeb as fully open - including training data - is a deliberate contrast to the approach taken by most of the field. Anthropic's computer use remains API-gated. OpenAI Operator is a product, not a research artifact. Anthropic's acquisition of Vercept last month brought in Allen Institute alumni specifically to improve closed computer use capabilities.

Ai2 is releasing the recipe, not just the dish.

What To Watch

OCR and Timing Sensitivity

Screenshot-based navigation has hard limits. Compressed images, small fonts, and high-DPI displays all introduce reading errors. Pages that load content steadily can catch the agent mid-render. Ai2 acknowledges both in the paper and the live demo is limited to a whitelist of supported sites for exactly this reason.

The training deliberately excluded authentication flows and financial transactions. That's a safety call, but it's also a significant scope limit. Most real enterprise workflows touch credentials or payments at some point.

The Leadership Gap

With Farhadi and Hajishirzi both moving to Microsoft, the researchers most associated with the OLMo and Molmo open model lines are now building proprietary systems. Whether Ai2 can sustain the same pace of releases under new leadership - and a funding model that rewards applied work - is the open question.

The models are on Hugging Face at allenai/MolmoWeb-4B and allenai/MolmoWeb-8B. The code is at github.com/allenai/molmoweb. The paper is arXiv 2601.10611.

Sources: