Qwen3.7-Max

Alibaba's agent-first flagship model with a 1M-token context window, topping Terminal-Bench 2.0 and SWE-Bench Pro at roughly one-sixth the cost of Claude Opus 4.7.

Qwen3.7-Max

Overview

Qwen3.7-Max is Alibaba's current flagship language model and the successor to Qwen3.6-Max-Preview. Released on May 19, 2026, and formally announced at the Alibaba Cloud Summit in Hangzhou the following day, it's the most capable model in the Qwen family - and the most capable Chinese model by several independent measures at time of launch.

TL;DR

  • Agent-first design built for multi-step autonomous coding and workflow tasks; scored 69.7 on Terminal-Bench 2.0 (ranked first at launch) and 60.6 on SWE-Bench Pro
  • 1M-token context window at $2.50/M input tokens - roughly 6x cheaper than Claude Opus 4.7 at equivalent context
  • Text-only input; high verbosity means real per-task costs are notably higher than the rate card suggests

The model's defining bet is endurance. Alibaba's internal benchmark had it running for 35 consecutive hours on a kernel optimization task, executing 1,158 tool calls against a Zhenwu M890 chip it had never seen in training, and delivering a production-grade kernel with a 10x speed improvement over the Triton reference. That claim hasn't been independently reproduced, and Alibaba ran it on its own hardware with a known target, so treat it as directional rather than authoritative. Still, the agentic benchmark scores from third parties hold up: Qwen3.7-Max ranks first on Terminal-Bench 2.0 and SWE-Bench Pro against the field of frontier models.

The jump from the predecessor is major. Qwen3.6-Max-Preview shipped with a 256K context window and no public pricing. Qwen3.7-Max quadruples that to 1M tokens, adds official pricing, and pushes benchmark scores up across every tracked agentic task. Alibaba also dropped the "Preview" label, signaling this as the production version.

Official Qwen3.7-Max promotional graphic showing key capabilities and agent harness integrations Qwen3.7-Max was formally announced at the 2026 Alibaba Cloud Summit alongside the Zhenwu M890 chip. Source: scmp.com


Key Specifications

SpecificationDetails
ProviderAlibaba (Qwen team)
Model FamilyQwen
ParametersNot disclosed
Context Window1,000,000 tokens
Max Output Tokens65,536 per request
Input Price (Alibaba Cloud)$2.50/M tokens
Output Price (Alibaba Cloud)$7.50/M tokens
Cached Input Price$0.25/M tokens
Release Date2026-05-19
LicenseProprietary (closed-weight)
ModalitiesText input, text output
API ProtocolsOpenAI and Anthropic compatible

Benchmark Performance

The table below compares Qwen3.7-Max against its three main competitors on the benchmarks where agentic models are most meaningfully differentiated. Numbers are from Artificial Analysis, Together AI, and published benchmark papers. The full agentic AI benchmarks leaderboard tracks these in real time.

BenchmarkQwen3.7-MaxClaude Opus 4.7GPT-5.5DeepSeek V4 Pro
AA Intelligence Index56.657.360.0~55.0
Terminal-Bench 2.069.765.4~64.067.9
SWE-Bench Pro60.6~59.0~58.059.0
SWE-Bench Verified80.480.8~79.080.6
GPQA Diamond92.4~92.093.6~91.0
HMMT 2026 Feb97.196.2~95.095.2
MCP-Atlas76.475.8 (Opus 4.6)--
MMLU92.8~91.0~93.0~91.0
MMLU-Pro82.0~82.0~84.0~80.0
HumanEval94.5~94.0~95.0~93.0

Scores marked ~ are estimates from partial data; treat them as rough positioning, not exact comparisons.

The overall intelligence index puts Qwen3.7-Max at fifth on the Artificial Analysis leaderboard, just behind Opus 4.7 and GPT-5.5. That's a slight overall deficit. On the SWE-bench coding leaderboard and Terminal-Bench specifically, it leads - which matters if agentic coding is your primary workload.

One caveat on interpreting these numbers: Qwen3.7-Max generated approximately 97 million tokens during the Artificial Analysis evaluation, versus a median of 24 million for comparable models. Scoring higher while creating 4x the output isn't necessarily efficient. Real-world cost per resolved task will run higher than the rate card implies.

The model's abstention behavior is also a known issue. On the AA-Omniscience knowledge benchmark, its attempt rate is 48%, meaning it refused to answer about half of knowledge-recall questions. Raw accuracy hits 30.1%, with a hallucination rate of 22.9%. On pure reasoning benchmarks it's strong; on knowledge retrieval it declines to answer more than it guesses.


Key Capabilities

Autonomous agent execution

Qwen3.7-Max was trained with a decoupled Task-Harness-Verifier architecture, which means it performs consistently across agent frameworks without needing framework-specific tuning. It runs inside Claude Code, OpenClaw, Qwen Code, and Qwen-RobotClaw without behavioral drift between harnesses. Most frontier models degrade when moved between scaffolds; this one doesn't, at least on published evaluations.

Native support for Model Context Protocol (MCP) is built in, not bolted on. The preserve_thinking feature lets the model carry its internal chain-of-thought state across multi-turn conversations, which matters for tasks that span dozens of steps. It also ships with a self-monitoring reward hacking detection system that flags when the model's tool calls start tuning for metrics rather than task completion - a real problem in long-horizon runs.

Qwen3.7-Max listed on Together AI's model catalog, showing API availability and context window specs Qwen3.7-Max is available through multiple inference providers including Together AI, OpenRouter, and Alibaba Cloud directly. Source: together.ai

Long-context retrieval

The 1M token context window is the largest in Qwen's history, doubled from the 256K that Qwen3.6-Max-Preview shipped with. Third-party testing on the MRCR-v2 128K retrieval benchmark puts it at 90.4, which verifies the claim past 128K. Whether it holds at 500K or 1M isn't yet independently confirmed. This is a significant caveat: most "1M context" models show degradation in retrieval accuracy past their verified range. See our long-context benchmarks leaderboard for the current state of play.

Protocol-level dual compatibility

Qwen3.7-Max supports both OpenAI and Anthropic Messages API formats natively at the endpoint level, not through a translation shim. That means existing Claude Code or OpenAI SDK integrations can switch to Qwen3.7-Max with a base URL change and no prompt rewriting. This is a practical advantage for developers who want to test cost reduction without a migration.


Pricing and Availability

Alibaba Cloud Model Studio charges $2.50 per million input tokens and $7.50 per million output tokens, with a 90% discount on cached input at $0.25/M. This puts it significantly below the main western frontier models:

ModelInput (per 1M)Output (per 1M)Context
Qwen3.7-Max$2.50$7.501M
Claude Opus 4.7$15.00$75.001M
GPT-5.5$10.00$30.00256K
DeepSeek V4 Pro~$2.80~$8.401M

The price advantage is real on rate cards. On actual agent runs, the verbosity issue closes the gap. If you're running tasks where the model creates 4x the output tokens of the alternative, the effective cost per task moves toward parity. Use max_tokens limits in your implementation to keep this under control.

Third-party providers offer lower rates: Together AI lists it at $1.25/M input and $3.75/M output. OpenRouter also carries it. For low-volume experimentation, these are cheaper entry points.

The model is closed-weight, with no Hugging Face upload or self-hosting path. API access requires an Alibaba Cloud or third-party account. The model ID is qwen3.7-max on Alibaba Cloud Model Studio.


Strengths and Weaknesses

Strengths

  • Leads on Terminal-Bench 2.0 and SWE-Bench Pro against the full frontier field at time of launch
  • 1M token context at $2.50/M input, the best value long-context option among frontier models
  • Cross-harness consistency: runs in Claude Code, OpenClaw, and Qwen Code without behavioral drift
  • Native MCP support and Anthropic API protocol compatibility, making migration from Claude straightforward
  • preserve_thinking feature enables coherent chain-of-thought across multi-turn agent sessions

Weaknesses

  • Text-only: no vision or audio inputs, falling behind multimodal competitors like GPT-5.5 and Gemini
  • Extremely high token verbosity inflates real-world costs beyond what the rate card shows
  • 48% attempt rate on knowledge benchmarks - abstains on roughly half of knowledge-recall questions
  • 35-hour autonomous demo was Alibaba's own internal benchmark on Alibaba's own hardware; not independently reproduced
  • Closed-weight with no self-hosting option, unlike DeepSeek V4 which ships open-source weights
  • Preview stability issues reported with documents approaching the full 1M token limit

FAQ

How does Qwen3.7-Max compare to Claude Opus 4.7?

Qwen3.7-Max scores slightly lower on the overall Artificial Analysis Intelligence Index (56.6 vs 57.3) but leads on Terminal-Bench 2.0 and SWE-Bench Pro. It costs roughly 6x less per million input tokens and shares the same 1M context window.

Is Qwen3.7-Max open source?

No. Alibaba released it as a closed-weight proprietary model with no public weights. API access only, through Alibaba Cloud Model Studio, Together AI, and OpenRouter.

Can I use Qwen3.7-Max with the Anthropic SDK?

Yes. The model natively supports the Anthropic Messages API format at the endpoint level. Switching from Claude requires changing the base URL and model ID; no prompt rewriting needed.

What is the actual cost of running Qwen3.7-Max on long agent tasks?

The official rate is $2.50/M input and $7.50/M output on Alibaba Cloud. But the model produces roughly 4x more output tokens than comparable models on agent evaluations, so effective cost per task is closer to 2-3x the rate card rather than 6x cheaper than Opus 4.7.

What agent frameworks does Qwen3.7-Max support?

It runs in Claude Code, OpenClaw, Qwen Code, and Qwen-RobotClaw. The training architecture decouples Task, Harness, and Verifier components, giving it consistent behavior across scaffolds rather than degrading when moved between frameworks.

Does Qwen3.7-Max support vision or multimodal inputs?

No. As of May 2026, it's text-only. Alibaba's multimodal offerings remain separate from the Qwen Max line.


Sources:

✓ Last verified May 28, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.