Qwen3.7-Max
Alibaba's agent-first flagship model with a 1M-token context window, topping Terminal-Bench 2.0 and SWE-Bench Pro at roughly one-sixth the cost of Claude Opus 4.7.

Overview
Qwen3.7-Max is Alibaba's current flagship language model and the successor to Qwen3.6-Max-Preview. Released on May 19, 2026, and formally announced at the Alibaba Cloud Summit in Hangzhou the following day, it's the most capable model in the Qwen family - and the most capable Chinese model by several independent measures at time of launch.
TL;DR
- Agent-first design built for multi-step autonomous coding and workflow tasks; scored 69.7 on Terminal-Bench 2.0 (ranked first at launch) and 60.6 on SWE-Bench Pro
- 1M-token context window at $2.50/M input tokens - roughly 6x cheaper than Claude Opus 4.7 at equivalent context
- Text-only input; high verbosity means real per-task costs are notably higher than the rate card suggests
The model's defining bet is endurance. Alibaba's internal benchmark had it running for 35 consecutive hours on a kernel optimization task, executing 1,158 tool calls against a Zhenwu M890 chip it had never seen in training, and delivering a production-grade kernel with a 10x speed improvement over the Triton reference. That claim hasn't been independently reproduced, and Alibaba ran it on its own hardware with a known target, so treat it as directional rather than authoritative. Still, the agentic benchmark scores from third parties hold up: Qwen3.7-Max ranks first on Terminal-Bench 2.0 and SWE-Bench Pro against the field of frontier models.
The jump from the predecessor is major. Qwen3.6-Max-Preview shipped with a 256K context window and no public pricing. Qwen3.7-Max quadruples that to 1M tokens, adds official pricing, and pushes benchmark scores up across every tracked agentic task. Alibaba also dropped the "Preview" label, signaling this as the production version.
Qwen3.7-Max was formally announced at the 2026 Alibaba Cloud Summit alongside the Zhenwu M890 chip.
Source: scmp.com
Key Specifications
| Specification | Details |
|---|---|
| Provider | Alibaba (Qwen team) |
| Model Family | Qwen |
| Parameters | Not disclosed |
| Context Window | 1,000,000 tokens |
| Max Output Tokens | 65,536 per request |
| Input Price (Alibaba Cloud) | $2.50/M tokens |
| Output Price (Alibaba Cloud) | $7.50/M tokens |
| Cached Input Price | $0.25/M tokens |
| Release Date | 2026-05-19 |
| License | Proprietary (closed-weight) |
| Modalities | Text input, text output |
| API Protocols | OpenAI and Anthropic compatible |
Benchmark Performance
The table below compares Qwen3.7-Max against its three main competitors on the benchmarks where agentic models are most meaningfully differentiated. Numbers are from Artificial Analysis, Together AI, and published benchmark papers. The full agentic AI benchmarks leaderboard tracks these in real time.
| Benchmark | Qwen3.7-Max | Claude Opus 4.7 | GPT-5.5 | DeepSeek V4 Pro |
|---|---|---|---|---|
| AA Intelligence Index | 56.6 | 57.3 | 60.0 | ~55.0 |
| Terminal-Bench 2.0 | 69.7 | 65.4 | ~64.0 | 67.9 |
| SWE-Bench Pro | 60.6 | ~59.0 | ~58.0 | 59.0 |
| SWE-Bench Verified | 80.4 | 80.8 | ~79.0 | 80.6 |
| GPQA Diamond | 92.4 | ~92.0 | 93.6 | ~91.0 |
| HMMT 2026 Feb | 97.1 | 96.2 | ~95.0 | 95.2 |
| MCP-Atlas | 76.4 | 75.8 (Opus 4.6) | - | - |
| MMLU | 92.8 | ~91.0 | ~93.0 | ~91.0 |
| MMLU-Pro | 82.0 | ~82.0 | ~84.0 | ~80.0 |
| HumanEval | 94.5 | ~94.0 | ~95.0 | ~93.0 |
Scores marked ~ are estimates from partial data; treat them as rough positioning, not exact comparisons.
The overall intelligence index puts Qwen3.7-Max at fifth on the Artificial Analysis leaderboard, just behind Opus 4.7 and GPT-5.5. That's a slight overall deficit. On the SWE-bench coding leaderboard and Terminal-Bench specifically, it leads - which matters if agentic coding is your primary workload.
One caveat on interpreting these numbers: Qwen3.7-Max generated approximately 97 million tokens during the Artificial Analysis evaluation, versus a median of 24 million for comparable models. Scoring higher while creating 4x the output isn't necessarily efficient. Real-world cost per resolved task will run higher than the rate card implies.
The model's abstention behavior is also a known issue. On the AA-Omniscience knowledge benchmark, its attempt rate is 48%, meaning it refused to answer about half of knowledge-recall questions. Raw accuracy hits 30.1%, with a hallucination rate of 22.9%. On pure reasoning benchmarks it's strong; on knowledge retrieval it declines to answer more than it guesses.
Key Capabilities
Autonomous agent execution
Qwen3.7-Max was trained with a decoupled Task-Harness-Verifier architecture, which means it performs consistently across agent frameworks without needing framework-specific tuning. It runs inside Claude Code, OpenClaw, Qwen Code, and Qwen-RobotClaw without behavioral drift between harnesses. Most frontier models degrade when moved between scaffolds; this one doesn't, at least on published evaluations.
Native support for Model Context Protocol (MCP) is built in, not bolted on. The preserve_thinking feature lets the model carry its internal chain-of-thought state across multi-turn conversations, which matters for tasks that span dozens of steps. It also ships with a self-monitoring reward hacking detection system that flags when the model's tool calls start tuning for metrics rather than task completion - a real problem in long-horizon runs.
Qwen3.7-Max is available through multiple inference providers including Together AI, OpenRouter, and Alibaba Cloud directly.
Source: together.ai
Long-context retrieval
The 1M token context window is the largest in Qwen's history, doubled from the 256K that Qwen3.6-Max-Preview shipped with. Third-party testing on the MRCR-v2 128K retrieval benchmark puts it at 90.4, which verifies the claim past 128K. Whether it holds at 500K or 1M isn't yet independently confirmed. This is a significant caveat: most "1M context" models show degradation in retrieval accuracy past their verified range. See our long-context benchmarks leaderboard for the current state of play.
Protocol-level dual compatibility
Qwen3.7-Max supports both OpenAI and Anthropic Messages API formats natively at the endpoint level, not through a translation shim. That means existing Claude Code or OpenAI SDK integrations can switch to Qwen3.7-Max with a base URL change and no prompt rewriting. This is a practical advantage for developers who want to test cost reduction without a migration.
Pricing and Availability
Alibaba Cloud Model Studio charges $2.50 per million input tokens and $7.50 per million output tokens, with a 90% discount on cached input at $0.25/M. This puts it significantly below the main western frontier models:
| Model | Input (per 1M) | Output (per 1M) | Context |
|---|---|---|---|
| Qwen3.7-Max | $2.50 | $7.50 | 1M |
| Claude Opus 4.7 | $15.00 | $75.00 | 1M |
| GPT-5.5 | $10.00 | $30.00 | 256K |
| DeepSeek V4 Pro | ~$2.80 | ~$8.40 | 1M |
The price advantage is real on rate cards. On actual agent runs, the verbosity issue closes the gap. If you're running tasks where the model creates 4x the output tokens of the alternative, the effective cost per task moves toward parity. Use max_tokens limits in your implementation to keep this under control.
Third-party providers offer lower rates: Together AI lists it at $1.25/M input and $3.75/M output. OpenRouter also carries it. For low-volume experimentation, these are cheaper entry points.
The model is closed-weight, with no Hugging Face upload or self-hosting path. API access requires an Alibaba Cloud or third-party account. The model ID is qwen3.7-max on Alibaba Cloud Model Studio.
Strengths and Weaknesses
Strengths
- Leads on Terminal-Bench 2.0 and SWE-Bench Pro against the full frontier field at time of launch
- 1M token context at $2.50/M input, the best value long-context option among frontier models
- Cross-harness consistency: runs in Claude Code, OpenClaw, and Qwen Code without behavioral drift
- Native MCP support and Anthropic API protocol compatibility, making migration from Claude straightforward
preserve_thinkingfeature enables coherent chain-of-thought across multi-turn agent sessions
Weaknesses
- Text-only: no vision or audio inputs, falling behind multimodal competitors like GPT-5.5 and Gemini
- Extremely high token verbosity inflates real-world costs beyond what the rate card shows
- 48% attempt rate on knowledge benchmarks - abstains on roughly half of knowledge-recall questions
- 35-hour autonomous demo was Alibaba's own internal benchmark on Alibaba's own hardware; not independently reproduced
- Closed-weight with no self-hosting option, unlike DeepSeek V4 which ships open-source weights
- Preview stability issues reported with documents approaching the full 1M token limit
Related Coverage
- Qwen3.6-Max-Preview - predecessor model profile
- DeepSeek V4 - closest open-source competitor
- Claude Opus 4.7 - western frontier comparison
- Kimi K2.6 - strong competitor on SWE-Bench Pro
- Agentic AI Benchmarks Leaderboard
- SWE-Bench Coding Agent Leaderboard
- Long-Context Benchmarks Leaderboard
- Alibaba shifts Qwen to closed weights with Qwen3.6-Max
FAQ
How does Qwen3.7-Max compare to Claude Opus 4.7?
Qwen3.7-Max scores slightly lower on the overall Artificial Analysis Intelligence Index (56.6 vs 57.3) but leads on Terminal-Bench 2.0 and SWE-Bench Pro. It costs roughly 6x less per million input tokens and shares the same 1M context window.
Is Qwen3.7-Max open source?
No. Alibaba released it as a closed-weight proprietary model with no public weights. API access only, through Alibaba Cloud Model Studio, Together AI, and OpenRouter.
Can I use Qwen3.7-Max with the Anthropic SDK?
Yes. The model natively supports the Anthropic Messages API format at the endpoint level. Switching from Claude requires changing the base URL and model ID; no prompt rewriting needed.
What is the actual cost of running Qwen3.7-Max on long agent tasks?
The official rate is $2.50/M input and $7.50/M output on Alibaba Cloud. But the model produces roughly 4x more output tokens than comparable models on agent evaluations, so effective cost per task is closer to 2-3x the rate card rather than 6x cheaper than Opus 4.7.
What agent frameworks does Qwen3.7-Max support?
It runs in Claude Code, OpenClaw, Qwen Code, and Qwen-RobotClaw. The training architecture decouples Task, Harness, and Verifier components, giving it consistent behavior across scaffolds rather than degrading when moved between frameworks.
Does Qwen3.7-Max support vision or multimodal inputs?
No. As of May 2026, it's text-only. Alibaba's multimodal offerings remain separate from the Qwen Max line.
Sources:
- Alibaba unveils Qwen 3.7 Max - South China Morning Post
- Qwen 3.7 Max: Alibaba's New Flagship AI Model - Digital Applied
- Alibaba introduces Qwen3.7-Max as next-gen AI agent model - TechNode
- Qwen Introduces Qwen3.7-Max - MarkTechPost
- Qwen 3.7 Max preview review - Decrypt
- Qwen3.7-Max API - Together AI
- Alibaba Unveils New AI Chip, Flagship Model - Alibaba Cloud Blog
- Qwen3.7-Max Review - Build Fast with AI
- Qwen 3.7 Max vs Claude Opus 4.7 - AIMadeTools
- Qwen3.7-Max Developer Guide - OFox AI
✓ Last verified May 28, 2026
