Qwen3.7-Plus
Alibaba's first multimodal agent model, combining GUI grounding (ScreenSpot Pro 79.0), 1M-token context, and text-plus-vision input at $0.40/M tokens.

Overview
Qwen3.7-Plus reached general availability on June 1, 2026, and represents a meaningful shift in what Alibaba is building with the Qwen family. Where Qwen3.7-Max is a text-only reasoning engine optimized for coding and agentic tasks, Plus is designed around visual perception - it accepts screenshots, document images, and video frames as input and can ground its outputs in what it actually sees on screen. The practical payoff: it can plan and execute multi-step GUI workflows with a precision that text-only models can't match.
TL;DR
- Best-in-class GUI grounding at launch: 79.0 on ScreenSpot Pro, roughly 12 points ahead of GPT-5.4's 67.4
- $0.40/M input, $1.60/M output - six times cheaper than Qwen3.7-Max; proprietary API-only, no open weights
- Weaker on pure-text software engineering (SWE-Bench Pro ~57-60%) than Max; strongest value proposition is visual automation
The architecture extends the Qwen3.7-Max language backbone with a multimodal perception layer that handles text, images, and video as a unified input stream. Outputs are text-only - this isn't a generative image model. Think of it as adding eyes to a system that already knew how to reason. The model can identify pixel-level coordinates for UI interactions ("click at x=487, y=232") and create structured action plans from screenshots, making it a direct competitor to Claude's Computer Use and GPT-5.4's GUI agent capabilities.
What's worth noting is that Plus ships proprietary and API-only, with no open weights available at launch. That's a break from Alibaba's pattern with the Qwen series, which had built a strong reputation for releasing model checkpoints under permissive licenses. Whether open-weight variants follow in Q3 2026 remains unconfirmed. For teams building on sovereign infrastructure or needing self-hosted deployment, this is a hard blocker.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Alibaba |
| Model Family | Qwen |
| Parameters | Not disclosed |
| Context Window | 1M tokens |
| Max Output | 65,536 tokens |
| Max Thinking Tokens | 256K tokens |
| Input Price | $0.40/M tokens |
| Output Price | $1.60/M tokens |
| Cached Input | $0.04-$0.08/M tokens |
| Release Date | June 1, 2026 |
| License | Proprietary (API-only) |
| Modalities | Text + image + video input; text output |
GUI automation is the core use case for Qwen3.7-Plus, which can identify specific pixel coordinates and plan multi-step UI interactions from screenshots.
Source: unsplash.com
Benchmark Performance
The benchmark that matters most for this model's intended use case is ScreenSpot Pro, which measures GUI grounding - the ability to look at a screenshot and identify the correct target coordinates. Qwen3.7-Plus scored 79.0 at launch, which is 12 points ahead of GPT-5.4 (67.4) and well above what Claude Opus 4.6 was putting up on OSWorld-Verified (72.7%). That's not a narrow lead - on GUI grounding, Plus is in a different bracket from current text-plus-tool-call approaches.
| Benchmark | Qwen3.7-Plus | Qwen3.7-Max | GPT-5.4 |
|---|---|---|---|
| ScreenSpot Pro | 79.0 | N/A (text-only) | 67.4 |
| AndroidWorld | 81.0 | N/A | Not disclosed |
| Terminal-Bench | 70.3 | 69.7 | Not disclosed |
| MCP-Atlas | 76.4 | Not disclosed | Not disclosed |
| GPQA Diamond | 90.3 | Not disclosed | Not disclosed |
| MMLU-Pro | 89.6 | ~90+ | 89.7 |
| MRCR-v2 128k | 91.7 | Not disclosed | Not disclosed |
| SWE-Bench Pro | ~57-60% | 60.6 | Not disclosed |
| AA Intelligence Index | 53.3 | ~57 | Not disclosed |
A few caveats on these numbers: all benchmark scores above are vendor-reported or sourced from early third-party assessments. Independent reproduction of ScreenSpot Pro at this scale is still limited. The MMLU-Pro result (89.6) is notably close to Claude Opus 4.6's 89.7, suggesting Plus competes with frontier text models on knowledge tasks even though it trades some pure-text coding depth to add vision capabilities.
The weak spot is SWE-Bench Pro, where Plus scores around 57-60% against Max's 60.6%. If your main workload is software engineering without GUI components, Max is the better choice and Plus doesn't add value.
Qwen3.7-Plus combines vision and language in a single model, enabling tasks that require understanding both screen content and natural language instructions.
Source: unsplash.com
Key Capabilities
The model's strongest use case is GUI automation - browser tasks, desktop application control, mobile navigation, and anything that requires reading a screen and deciding what to interact with. The 81.0 score on AndroidWorld is especially worth attention: it measures end-to-end mobile task completion, not just localization accuracy. The jump from identifying a UI element to actually completing a goal across multiple steps is where many GUI agents fall apart, and Plus holds up here.
On ScreenSpot Pro, Plus scores 79.0. The closest competitor sits at 67.4. That 12-point gap represents a meaningful capability difference for teams building automation pipelines.
Document understanding is the second strong suit. The 91.7 on MRCR-v2 128k shows that retrieval accuracy holds at long context - relevant for PDF-heavy workflows, contract review, or research ingestion where you're pulling facts from documents that span hundreds of pages. The 256K thinking token budget also means the model can reason deeply on complex multi-turn problems before producing an answer.
Hybrid surface support - handling both GUI interactions and terminal/CLI commands from a single model - is a practical feature that saves architectural complexity. Rather than routing visual tasks to one model and command-line tasks to another, Plus handles both. It integrates with OpenClaw, Claude Code, and Qwen Code scaffolds, which means dropping it into existing agentic workflows requires minimal plumbing changes.
Pricing and Availability
At $0.40/M input and $1.60/M output, Plus is six times cheaper on input than Qwen3.7-Max ($2.50/$7.50). Cached input drops further to $0.04-$0.08/M tokens, which matters for agentic runs where context is preserved across turns. OpenRouter was showing a 20% promotional discount ($0.32/$1.28) at launch, though that rate should be confirmed before budgeting.
For comparison, DeepSeek V4 Flash still undercuts Plus markedly at $0.14/M input - but it's text-only. Among multimodal agents, Plus is competitive on cost. Claude Opus 4.8's computer use pricing is higher for comparable tasks.
Access points at launch:
- Alibaba Cloud Model Studio - primary API endpoint, OpenAI-compatible, available in Beijing, Singapore, and US-Virginia regions
- chat.qwen.ai - browser interface for direct testing
- Together AI and OpenRouter - third-party API access
The model isn't available for self-hosting and there are no open weights.
Strengths and Weaknesses
Strengths
- Best-in-class GUI grounding at launch (ScreenSpot Pro 79.0, AndroidWorld 81.0)
- Strong long-context retrieval accuracy (MRCR-v2 128k: 91.7)
- Competitive pricing vs. other multimodal agents at $0.40/M input
- Hybrid GUI + CLI support from a single model
- 1M-token context window that multimodal inputs share with text
Weaknesses
- No open weights - proprietary API-only, incompatible with sovereign or air-gapped deployments
- Pure-text SWE-Bench Pro (~57-60%) is weaker than Qwen3.7-Max (60.6%)
- Benchmark scores are largely vendor-reported now; independent audits are limited
- Performance degrades on heavily animated or custom UI interfaces
- High verbosity noted in early testing (~110M output tokens vs. 29M median across comparable models)
Related Coverage
- Qwen3.7-Max model profile - the text-only flagship Plus is built on
- Qwen3.6-Max model profile - prior generation
- Our Qwen3.6-Max review - hands-on benchmarking of the predecessor
- Computer Use Leaderboard - where GUI agent models stack up
- Our Qwen 3 review - earlier coverage of the Qwen 3 series launch
Sources
- Qwen 3.7 Plus is now live on Fireworks - Fireworks AI announcement with specs
- Qwen3.7-Plus GA Release - Digital Applied
- Qwen3.7-Plus on OpenRouter - Pricing & Specs
- Build Fast With AI: Qwen3.7-Plus Review
- APIdog: Qwen 3.7 Plus - Alibaba's multimodal agent model
- Artificial Analysis: DeepSeek V4 Flash comparison
✓ Last verified June 16, 2026
