Name: Qwen3.7-Max
Author: Alibaba

Overview

Qwen3.7-Max is Alibaba's current flagship language model and the successor to Qwen3.6-Max-Preview. Released on May 19, 2026, and formally announced at the Alibaba Cloud Summit in Hangzhou the following day, it's the most capable model in the Qwen family - and the most capable Chinese model by several independent measures at time of launch.

TL;DR

Agent-first design built for multi-step autonomous coding and workflow tasks; scored 69.7 on Terminal-Bench 2.0 (ranked first at launch) and 60.6 on SWE-Bench Pro
1M-token context window at $2.50/M input tokens - roughly 6x cheaper than Claude Opus 4.7 at equivalent context
Text-only input; high verbosity means real per-task costs are notably higher than the rate card suggests

The model's defining bet is endurance. Alibaba's internal benchmark had it running for 35 consecutive hours on a kernel optimization task, executing 1,158 tool calls against a Zhenwu M890 chip it had never seen in training, and delivering a production-grade kernel with a 10x speed improvement over the Triton reference. That claim hasn't been independently reproduced, and Alibaba ran it on its own hardware with a known target, so treat it as directional rather than authoritative. Still, the agentic benchmark scores from third parties hold up: Qwen3.7-Max ranks first on Terminal-Bench 2.0 and SWE-Bench Pro against the field of frontier models.

The jump from the predecessor is major. Qwen3.6-Max-Preview shipped with a 256K context window and no public pricing. Qwen3.7-Max quadruples that to 1M tokens, adds official pricing, and pushes benchmark scores up across every tracked agentic task. Alibaba also dropped the "Preview" label, signaling this as the production version.

Official Qwen3.7-Max promotional graphic showing key capabilities and agent harness integrations Qwen3.7-Max was formally announced at the 2026 Alibaba Cloud Summit alongside the Zhenwu M890 chip. Source: scmp.com

Key Specifications

Specification	Details
Provider	Alibaba (Qwen team)
Model Family	Qwen
Parameters	Not disclosed
Context Window	1,000,000 tokens
Max Output Tokens	65,536 per request
Input Price (Alibaba Cloud)	$2.50/M tokens
Output Price (Alibaba Cloud)	$7.50/M tokens
Cached Input Price	$0.25/M tokens
Release Date	2026-05-19
License	Proprietary (closed-weight)
Modalities	Text input, text output
API Protocols	OpenAI and Anthropic compatible

Benchmark Performance

The table below compares Qwen3.7-Max against its three main competitors on the benchmarks where agentic models are most meaningfully differentiated. Numbers are from Artificial Analysis, Together AI, and published benchmark papers. The full agentic AI benchmarks leaderboard tracks these in real time.

Benchmark	Qwen3.7-Max	Claude Opus 4.7	GPT-5.5	DeepSeek V4 Pro
AA Intelligence Index	56.6	57.3	60.0	~55.0
Terminal-Bench 2.0	69.7	65.4	~64.0	67.9
SWE-Bench Pro	60.6	~59.0	~58.0	59.0
SWE-Bench Verified	80.4	80.8	~79.0	80.6
GPQA Diamond	92.4	~92.0	93.6	~91.0
HMMT 2026 Feb	97.1	96.2	~95.0	95.2
MCP-Atlas	76.4	75.8 (Opus 4.6)	-	-
MMLU	92.8	~91.0	~93.0	~91.0
MMLU-Pro	82.0	~82.0	~84.0	~80.0
HumanEval	94.5	~94.0	~95.0	~93.0

Scores marked ~ are estimates from partial data; treat them as rough positioning, not exact comparisons.

The overall intelligence index puts Qwen3.7-Max at fifth on the Artificial Analysis leaderboard, just behind Opus 4.7 and GPT-5.5. That's a slight overall deficit. On the SWE-bench coding leaderboard and Terminal-Bench specifically, it leads - which matters if agentic coding is your primary workload.

One caveat on interpreting these numbers: Qwen3.7-Max generated approximately 97 million tokens during the Artificial Analysis evaluation, versus a median of 24 million for comparable models. Scoring higher while creating 4x the output isn't necessarily efficient. Real-world cost per resolved task will run higher than the rate card implies.

The model's abstention behavior is also a known issue. On the AA-Omniscience knowledge benchmark, its attempt rate is 48%, meaning it refused to answer about half of knowledge-recall questions. Raw accuracy hits 30.1%, with a hallucination rate of 22.9%. On pure reasoning benchmarks it's strong; on knowledge retrieval it declines to answer more than it guesses.

Key Capabilities

Autonomous agent execution

Qwen3.7-Max was trained with a decoupled Task-Harness-Verifier architecture, which means it performs consistently across agent frameworks without needing framework-specific tuning. It runs inside Claude Code, OpenClaw, Qwen Code, and Qwen-RobotClaw without behavioral drift between harnesses. Most frontier models degrade when moved between scaffolds; this one doesn't, at least on published evaluations.

Native support for Model Context Protocol (MCP) is built in, not bolted on. The preserve_thinking feature lets the model carry its internal chain-of-thought state across multi-turn conversations, which matters for tasks that span dozens of steps. It also ships with a self-monitoring reward hacking detection system that flags when the model's tool calls start tuning for metrics rather than task completion - a real problem in long-horizon runs.

Qwen3.7-Max listed on Together AI's model catalog, showing API availability and context window specs Qwen3.7-Max is available through multiple inference providers including Together AI, OpenRouter, and Alibaba Cloud directly. Source: together.ai

Long-context retrieval

The 1M token context window is the largest in Qwen's history, doubled from the 256K that Qwen3.6-Max-Preview shipped with. Third-party testing on the MRCR-v2 128K retrieval benchmark puts it at 90.4, which verifies the claim past 128K. Whether it holds at 500K or 1M isn't yet independently confirmed. This is a significant caveat: most "1M context" models show degradation in retrieval accuracy past their verified range. See our long-context benchmarks leaderboard for the current state of play.

Protocol-level dual compatibility

Qwen3.7-Max supports both OpenAI and Anthropic Messages API formats natively at the endpoint level, not through a translation shim. That means existing Claude Code or OpenAI SDK integrations can switch to Qwen3.7-Max with a base URL change and no prompt rewriting. This is a practical advantage for developers who want to test cost reduction without a migration.

Pricing and Availability

Alibaba Cloud Model Studio charges $2.50 per million input tokens and $7.50 per million output tokens, with a 90% discount on cached input at $0.25/M. This puts it significantly below the main western frontier models:

Model	Input (per 1M)	Output (per 1M)	Context
Qwen3.7-Max	$2.50	$7.50	1M
Claude Opus 4.7	$15.00	$75.00	1M
GPT-5.5	$10.00	$30.00	256K
DeepSeek V4 Pro	~$2.80	~$8.40	1M

The price advantage is real on rate cards. On actual agent runs, the verbosity issue closes the gap. If you're running tasks where the model creates 4x the output tokens of the alternative, the effective cost per task moves toward parity. Use max_tokens limits in your implementation to keep this under control.

Third-party providers offer lower rates: Together AI lists it at $1.25/M input and $3.75/M output. OpenRouter also carries it. For low-volume experimentation, these are cheaper entry points.

The model is closed-weight, with no Hugging Face upload or self-hosting path. API access requires an Alibaba Cloud or third-party account. The model ID is qwen3.7-max on Alibaba Cloud Model Studio.

Strengths and Weaknesses

Strengths

Leads on Terminal-Bench 2.0 and SWE-Bench Pro against the full frontier field at time of launch
1M token context at $2.50/M input, the best value long-context option among frontier models
Cross-harness consistency: runs in Claude Code, OpenClaw, and Qwen Code without behavioral drift
Native MCP support and Anthropic API protocol compatibility, making migration from Claude straightforward
preserve_thinking feature enables coherent chain-of-thought across multi-turn agent sessions

Weaknesses

Text-only: no vision or audio inputs, falling behind multimodal competitors like GPT-5.5 and Gemini
Extremely high token verbosity inflates real-world costs beyond what the rate card shows
48% attempt rate on knowledge benchmarks - abstains on roughly half of knowledge-recall questions
35-hour autonomous demo was Alibaba's own internal benchmark on Alibaba's own hardware; not independently reproduced
Closed-weight with no self-hosting option, unlike DeepSeek V4 which ships open-source weights
Preview stability issues reported with documents approaching the full 1M token limit

FAQ

How does Qwen3.7-Max compare to Claude Opus 4.7?

Qwen3.7-Max scores slightly lower on the overall Artificial Analysis Intelligence Index (56.6 vs 57.3) but leads on Terminal-Bench 2.0 and SWE-Bench Pro. It costs roughly 6x less per million input tokens and shares the same 1M context window.

Is Qwen3.7-Max open source?

No. Alibaba released it as a closed-weight proprietary model with no public weights. API access only, through Alibaba Cloud Model Studio, Together AI, and OpenRouter.

Can I use Qwen3.7-Max with the Anthropic SDK?

Yes. The model natively supports the Anthropic Messages API format at the endpoint level. Switching from Claude requires changing the base URL and model ID; no prompt rewriting needed.

What is the actual cost of running Qwen3.7-Max on long agent tasks?

The official rate is $2.50/M input and $7.50/M output on Alibaba Cloud. But the model produces roughly 4x more output tokens than comparable models on agent evaluations, so effective cost per task is closer to 2-3x the rate card rather than 6x cheaper than Opus 4.7.

What agent frameworks does Qwen3.7-Max support?

It runs in Claude Code, OpenClaw, Qwen Code, and Qwen-RobotClaw. The training architecture decouples Task, Harness, and Verifier components, giving it consistent behavior across scaffolds rather than degrading when moved between frameworks.

Does Qwen3.7-Max support vision or multimodal inputs?

No. As of May 2026, it's text-only. Alibaba's multimodal offerings remain separate from the Qwen Max line.

Sources: