MiniMax M3 Makes 1M Context Viable With Sparse Attention

Million-token context windows have been available in AI APIs for the better part of a year. The problem was making them usable: standard attention scales quadratically with sequence length, so a 1M-token request costs roughly 100x more compute than a 10K request just from the attention operation alone. MiniMax M3, released June 1, is betting a custom sparse attention architecture can break that cost curve.

Key Specs

Spec	Value
Context window	1M tokens (512K guaranteed minimum)
Input price	$0.60/M tokens ($0.30/M during launch promo)
Output price	$2.40/M tokens ($1.20/M during launch promo)
Output speed	43.9 tokens/sec (Artificial Analysis)
SWE-Bench Pro	59.0% (vendor-reported)
Parameters	Not disclosed
Inputs	Text, image, video
Weights	Promised on Hugging Face by ~June 11

How Sparse Attention Changes the Equation

Standard full attention - the kind running in virtually every transformer rolled out today - has to consider every position in the context window when computing each output token. At 1M tokens that means attending to one million key-value pairs for every produced token. Memory traffic alone becomes the bottleneck before arithmetic does.

What MSA Actually Does

MiniMax Sparse Attention (MSA) breaks the problem in two stages. A lightweight index branch first selects which KV cache blocks are actually relevant to the current query. Only those blocks get passed into the full attention computation. The approach keeps a GQA (grouped-query attention) backbone with MSA layered on top - and crucially, it doesn't compress the KV state the way competing architectures like DeepSeek's MLA do. That preserves accuracy at very long contexts rather than trading it away for speed.

The implementation detail MiniMax highlights: KV blocks serve as the outer loop, so each block gets read from memory exactly once. Access patterns stay contiguous rather than scattered, which matters more than theoretical operation count when inference is memory-bandwidth-bound.

The Numbers at 1M Context

At maximum context length, MiniMax reports:

1/20th the per-token compute of M2 (the previous generation) at 1M tokens
9x faster prefill compared to M2 at 1M tokens
15x faster decoding at 1M tokens
4x faster than Flash-Sparse-Attention under the same head configuration

MiniMax Sparse Attention GQA block architecture showing the index branch and sparse branch components The MSA architecture diagram from Hugging Face's technical breakdown, showing how the index branch pre-selects KV blocks before the attention computation runs. Source: huggingface.co

The practical result: M3's API speed at 1M context is reported at roughly 100 tokens per second output - about 3x faster than Claude Opus models at that length, according to MiniMax. Artificial Analysis independently measured 43.9 tokens per second overall (not specifically at maximum context), which puts it ahead of most frontier models in the same price bracket.

Benchmark Results

Every number in the table below comes from MiniMax's own launch materials, run on MiniMax infrastructure with agent scaffolding MiniMax configured. Independent verification was pending at launch.

Benchmark	M3	Claude Opus 4.8	GPT-5.5	Gemini 3.1 Pro
SWE-Bench Pro	59.0%	69.2%	58.6%	54.2%
Terminal-Bench 2.1	66.0%	74.6%	-	-
OSWorld-Verified	70.1%	83.4%	-	-
BrowseComp	83.5%	-	-	-
PostTrainBench	37.1 (#3)	42.4 (#1)	39.3 (#2)	-

On coding evals, M3 edges past GPT-5.5 on SWE-Bench Pro while staying 10 points behind Claude Opus 4.8's 69.2%. BrowseComp at 83.5% - compared against Claude Opus 4.7's 79.3% - is M3's clearest result. Computer use at 70.1% on OSWorld trails Opus 4.8 by 13 points. For context on where these numbers sit in the broader SWE-Bench leaderboard, M3 lands solidly in the open-weight frontier tier but not at the top.

MiniMax also ran a 12-hour autonomous agent demonstration: an ICLR paper replication spanning 18 commits and 23 experimental figures. That's meant to show long-horizon agentic capability at scale, not just a benchmark score, but it's still a demo MiniMax staged and reported internally.

MiniMax M3 CUDA kernel optimization showing inference speedup progression during iterative optimization CUDA kernel optimization run from MiniMax's launch materials, showing iterative performance improvement across optimization cycles at 1M context. Source: minimax.io

Pricing and Availability

The standard API rate is $0.60/M input tokens and $2.40/M output tokens. A 50% launch discount ran for the first week, bringing input to $0.30/M. Even at full price, that's roughly 5-10% of what Claude Opus charges at comparable context lengths.

For a 500K-input plus 100K-output agentic coding run, M3 costs around $0.54 at standard rates versus roughly $5 on frontier proprietary models. The gap is real. Subscription plans on MiniMax Code: Plus at $20/month (~1.7B tokens), Max at $50/month, Ultra at $120/month. Tokens are unified across text, image, speech, and music.

One caveat on the pricing headline: inputs above 512K tokens carry a context surcharge. The "1M context at $0.60/M" marketing figure applies only up to the 512K guarantee; longer requests cost more. The full model profile has the per-tier pricing breakdown.

M3 is also available through OpenRouter at the same rates. For long-context benchmark comparisons against other 1M-context models, M3's combination of speed and price is truly competitive - if the benchmarks hold up under independent testing.

What To Watch

Weights are still pending. MiniMax promised open weights on Hugging Face within ten days of the June 1 launch, targeting around June 11. As of this writing, nothing has appeared under the MiniMaxAI organization. June 11 hasn't passed yet, so this isn't a miss - but the "open-weight" label is doing significant work while the weights aren't actually available.

License terms are unknown. MiniMax's prior model, M2.7, shipped under a modified-MIT license blocking commercial use without written authorization from MiniMax. The company used "open-weight" language for that release too. M3's license ships with the weights, so you won't know what you're working with until they land. Don't assume permissive licensing based on the "open" framing.

Every benchmark is vendor-reported. MiniMax chose which benchmarks to include, configured the agent scaffolding, and ran everything on its own infrastructure. Artificial Analysis placed M3 at rank #1 out of 164 models in its price class on their Intelligence Index - a useful independent signal, though that index uses a composite score rather than a single task benchmark. Time-to-first-token at 2.60 seconds is slower than it looks for interactive use cases.

Chinese jurisdiction applies. MiniMax is a Shanghai-based company. API traffic falls under China's 2017 National Intelligence Law. That's an operational consideration, not a technical flaw, but it belongs in any serious deployment evaluation for teams handling sensitive code or data.

The predecessor, M2.7, shipped with verified open weights and a confirmed 59% SWE-Bench score on its own. M3's architecture gains are real - the MSA math checks out and the efficiency claims are consistent with how sparse attention works in practice. Whether the benchmark scores hold up when independent evaluators run their own tests is the open question that matters now.

Sources: