Name: Qwen3.7-Plus
Author: Alibaba

Overview

Qwen3.7-Plus reached general availability on June 1, 2026, and represents a meaningful shift in what Alibaba is building with the Qwen family. Where Qwen3.7-Max is a text-only reasoning engine optimized for coding and agentic tasks, Plus is designed around visual perception - it accepts screenshots, document images, and video frames as input and can ground its outputs in what it actually sees on screen. The practical payoff: it can plan and execute multi-step GUI workflows with a precision that text-only models can't match.

TL;DR

Best-in-class GUI grounding at launch: 79.0 on ScreenSpot Pro, roughly 12 points ahead of GPT-5.4's 67.4
$0.40/M input, $1.60/M output - six times cheaper than Qwen3.7-Max; proprietary API-only, no open weights
Weaker on pure-text software engineering (SWE-Bench Pro ~57-60%) than Max; strongest value proposition is visual automation

The architecture extends the Qwen3.7-Max language backbone with a multimodal perception layer that handles text, images, and video as a unified input stream. Outputs are text-only - this isn't a generative image model. Think of it as adding eyes to a system that already knew how to reason. The model can identify pixel-level coordinates for UI interactions ("click at x=487, y=232") and create structured action plans from screenshots, making it a direct competitor to Claude's Computer Use and GPT-5.4's GUI agent capabilities.

What's worth noting is that Plus ships proprietary and API-only, with no open weights available at launch. That's a break from Alibaba's pattern with the Qwen series, which had built a strong reputation for releasing model checkpoints under permissive licenses. Whether open-weight variants follow in Q3 2026 remains unconfirmed. For teams building on sovereign infrastructure or needing self-hosted deployment, this is a hard blocker.

Key Specifications

Specification	Details
Provider	Alibaba
Model Family	Qwen
Parameters	Not disclosed
Context Window	1M tokens
Max Output	65,536 tokens
Max Thinking Tokens	256K tokens
Input Price	$0.40/M tokens
Output Price	$1.60/M tokens
Cached Input	$0.04-$0.08/M tokens
Release Date	June 1, 2026
License	Proprietary (API-only)
Modalities	Text + image + video input; text output

A computer workstation showing an automated GUI workflow with multiple application windows GUI automation is the core use case for Qwen3.7-Plus, which can identify specific pixel coordinates and plan multi-step UI interactions from screenshots. Source: unsplash.com

Benchmark Performance

The benchmark that matters most for this model's intended use case is ScreenSpot Pro, which measures GUI grounding - the ability to look at a screenshot and identify the correct target coordinates. Qwen3.7-Plus scored 79.0 at launch, which is 12 points ahead of GPT-5.4 (67.4) and well above what Claude Opus 4.6 was putting up on OSWorld-Verified (72.7%). That's not a narrow lead - on GUI grounding, Plus is in a different bracket from current text-plus-tool-call approaches.

Benchmark	Qwen3.7-Plus	Qwen3.7-Max	GPT-5.4
ScreenSpot Pro	79.0	N/A (text-only)	67.4
AndroidWorld	81.0	N/A	Not disclosed
Terminal-Bench	70.3	69.7	Not disclosed
MCP-Atlas	76.4	Not disclosed	Not disclosed
GPQA Diamond	90.3	Not disclosed	Not disclosed
MMLU-Pro	89.6	~90+	89.7
MRCR-v2 128k	91.7	Not disclosed	Not disclosed
SWE-Bench Pro	~57-60%	60.6	Not disclosed
AA Intelligence Index	53.3	~57	Not disclosed

A few caveats on these numbers: all benchmark scores above are vendor-reported or sourced from early third-party assessments. Independent reproduction of ScreenSpot Pro at this scale is still limited. The MMLU-Pro result (89.6) is notably close to Claude Opus 4.6's 89.7, suggesting Plus competes with frontier text models on knowledge tasks even though it trades some pure-text coding depth to add vision capabilities.

The weak spot is SWE-Bench Pro, where Plus scores around 57-60% against Max's 60.6%. If your main workload is software engineering without GUI components, Max is the better choice and Plus doesn't add value.

Abstract visualization of multimodal AI combining visual and language understanding Qwen3.7-Plus combines vision and language in a single model, enabling tasks that require understanding both screen content and natural language instructions. Source: unsplash.com

Key Capabilities

The model's strongest use case is GUI automation - browser tasks, desktop application control, mobile navigation, and anything that requires reading a screen and deciding what to interact with. The 81.0 score on AndroidWorld is especially worth attention: it measures end-to-end mobile task completion, not just localization accuracy. The jump from identifying a UI element to actually completing a goal across multiple steps is where many GUI agents fall apart, and Plus holds up here.

On ScreenSpot Pro, Plus scores 79.0. The closest competitor sits at 67.4. That 12-point gap represents a meaningful capability difference for teams building automation pipelines.

Document understanding is the second strong suit. The 91.7 on MRCR-v2 128k shows that retrieval accuracy holds at long context - relevant for PDF-heavy workflows, contract review, or research ingestion where you're pulling facts from documents that span hundreds of pages. The 256K thinking token budget also means the model can reason deeply on complex multi-turn problems before producing an answer.

Hybrid surface support - handling both GUI interactions and terminal/CLI commands from a single model - is a practical feature that saves architectural complexity. Rather than routing visual tasks to one model and command-line tasks to another, Plus handles both. It integrates with OpenClaw, Claude Code, and Qwen Code scaffolds, which means dropping it into existing agentic workflows requires minimal plumbing changes.

Pricing and Availability

At $0.40/M input and $1.60/M output, Plus is six times cheaper on input than Qwen3.7-Max ($2.50/$7.50). Cached input drops further to $0.04-$0.08/M tokens, which matters for agentic runs where context is preserved across turns. OpenRouter was showing a 20% promotional discount ($0.32/$1.28) at launch, though that rate should be confirmed before budgeting.

For comparison, DeepSeek V4 Flash still undercuts Plus markedly at $0.14/M input - but it's text-only. Among multimodal agents, Plus is competitive on cost. Claude Opus 4.8's computer use pricing is higher for comparable tasks.

Access points at launch:

Alibaba Cloud Model Studio - primary API endpoint, OpenAI-compatible, available in Beijing, Singapore, and US-Virginia regions
chat.qwen.ai - browser interface for direct testing
Together AI and OpenRouter - third-party API access

The model isn't available for self-hosting and there are no open weights.

Strengths and Weaknesses

Strengths

Best-in-class GUI grounding at launch (ScreenSpot Pro 79.0, AndroidWorld 81.0)
Strong long-context retrieval accuracy (MRCR-v2 128k: 91.7)
Competitive pricing vs. other multimodal agents at $0.40/M input
Hybrid GUI + CLI support from a single model
1M-token context window that multimodal inputs share with text

Weaknesses

No open weights - proprietary API-only, incompatible with sovereign or air-gapped deployments
Pure-text SWE-Bench Pro (~57-60%) is weaker than Qwen3.7-Max (60.6%)
Benchmark scores are largely vendor-reported now; independent audits are limited
Performance degrades on heavily animated or custom UI interfaces
High verbosity noted in early testing (~110M output tokens vs. 29M median across comparable models)

Qwen3.7-Max model profile - the text-only flagship Plus is built on
Qwen3.6-Max model profile - prior generation
Our Qwen3.6-Max review - hands-on benchmarking of the predecessor
Computer Use Leaderboard - where GUI agent models stack up
Our Qwen 3 review - earlier coverage of the Qwen 3 series launch