SubQ Review: 52x Faster, but Show Your Work

Subquadratic's SubQ claims the first linear-scaling LLM with a 12M-token window - but private beta access, self-reported benchmarks, and a 17-point MRCR gap make independent verification the only test that matters.

SubQ Review: 52x Faster, but Show Your Work

Every transformer-based model shares the same structural flaw: attention is quadratic. Double the context, and compute doesn't double - it quadruples. That ceiling is why the strongest labs in the world have stalled around 1-2 million tokens despite years of engineering against it. SubQ, from a Miami startup called Subquadratic, claims to have redesigned attention from the ground up to make that ceiling disappear. It's either the most consequential architecture shift since the transformer itself, or it's another chapter in the long history of long-context claims that dissolve under scrutiny.

TL;DR

  • 6.5/10 - Architecturally credible, but almost nothing is independently verified
  • SSA (Subquadratic Sparse Attention) claims linear O(n) scaling, with a measured 52x wall-clock speedup over FlashAttention at 1M tokens on B200 GPUs
  • A 17-point gap between the lab MRCR score (83%) and the production score (65.9%) is the number that most needs explaining
  • Wait for the technical paper and public API before making architectural decisions around it

The architectural argument is coherent enough to take seriously. SubQ's SSA mechanism routes each query token toward a content-selected subset of positions rather than computing attention over every position in the sequence. Linear scaling means doubling context doubles compute instead of quadrupling it. At 12 million tokens, the company reports roughly 1,000x lower attention compute versus standard dense transformers. Three-stage training - pretraining, supervised fine-tuning, and reinforcement learning specifically tuned for long-context retrieval - aims to preserve accuracy while the sparsity does the efficiency work.

The problem is that serious evaluation hasn't happened yet. SubQ launched on May 5, 2026, with $29M in seed funding, three products in private beta, and benchmark numbers from a third-party testing service that hasn't been named. No weights. No technical paper. No independent reproduction. What follows is an analysis of what the company says and what that actually shows.


What SSA Actually Does

The quadratic problem isn't new. Researchers have been attacking it for years with sliding window attention, fixed-block sparsity, linear recurrence models like Mamba, and hybrid approaches. Each carries trade-offs. Sliding windows miss long-range dependencies. Fixed-block sparsity misses cross-block relationships. Linear recurrences compress context rather than attending to it directly.

SSA's pitch is different. Instead of attending to a fixed local region or a predetermined pattern, SSA selects positions based on content. For a given query token, the model learns which positions are relevant and computes exact attention only over those. The selection mechanism scales linearly; the attention over the selected set is exact. This means the model can, in principle, attend to information anywhere in a 12-million-token sequence as long as the selection step routes toward it.

The team reports wall-clock speedups on NVIDIA B200s at various context lengths:

Context LengthSpeedup vs FlashAttention-2
128K7.2x
256K13.2x
512K23.0x
1M52.2x

These are real numbers measured in hardware time, not theoretical FLOPs. If they hold under independent test, the efficiency story is genuine. The scaling curve is the right shape for a linear-complexity claim.

SubQ's website at launch, showing its efficiency-first positioning SubQ launched on May 5, 2026, offering an API, a CLI coding agent, and a free search product - all in private beta. Source: subq.ai


The Benchmarks: What the Numbers Say

Subquadratic ran three evaluations through a third-party testing service. The selection is narrow - long-context retrieval and coding only - and broader reasoning, math, multilingual, and safety benchmarks haven't been published.

BenchmarkSubQ 1M-PreviewClaude Opus 4.7GPT-5.5
RULER 128K95.0%94.8%n/a
MRCR v2 (1M tokens)65.9%32.2%*74.0%
SWE-Bench Verified81.8%87.6%n/a

*The MRCR figure for Claude Opus 4.7 comes from Subquadratic's press materials and appears inconsistent with results from other sources. The company's own SSA technical blog quotes Claude Opus 4.6 at 78.3% on MRCR v2 - a figure that differs substantially from the 32.2% cited in the launch announcement. These discrepancies matter and haven't been explained.

The RULER result is the cleanest data point. 95% at 128K, versus Claude Opus 4.7's 94.8%, at a claimed cost of $8 versus roughly $2,600 for Claude Opus at comparable accuracy. If that cost ratio is real, it changes the economics of long-context inference at scale. The efficiency argument doesn't require SubQ to be the smartest model in the room - it requires it to be accurate enough at a fraction of the price.

The SWE-Bench Verified number is weaker. 81.8% is competitive with Claude Opus 4.6 but trails Claude Opus 4.7 (87.6%) and the SWE-Bench leaderboard's current top performers. For a model being positioned partly as a coding agent, that gap matters.

The MRCR Production Gap

The number that most demands an explanation is the MRCR v2 score at 1M tokens. Subquadratic's technical blog quotes a result of 86.2% on this benchmark. The launch announcement references 83%. The third-party testing service returned 65.9% for the production model.

That's a 17-20 point spread between the lab and the shipping product on one of the three benchmarks the company published. Subquadratic hasn't addressed it publicly. DataCamp noted the mismatch in its analysis but couldn't get clarification. The most charitable interpretation is that the research model and production model differ significantly. The less charitable one is that this is the familiar gap between best-case lab conditions and what the product actually delivers.

"SubQ is either the biggest breakthrough since the Transformer... or it's AI Theranos." - Dan McAteer, AI commentator, via byteiota.com


The Magic.dev Problem

Any serious look at SubQ has to contend with Magic.dev. In 2024, Magic raised $465M and announced a 100-million-token context window with similar efficiency arguments. Eighteen months later, there's limited public evidence of the technology working at production scale outside the company, and adoption has been quiet relative to the headline numbers. Subquadratic's $29M is smaller and the claims are narrower, but the pattern - extraordinary long-context claims, private beta, no external verification - is familiar.

The difference, if there is one, is that SSA's linear-scaling argument is more technically grounded than what Magic published. The speedup curves on B200 hardware are specific and falsifiable. The company's 11-person PhD research team, drawn from Meta, Google, Oxford, Cambridge, and ByteDance, has real credentials. CTO Alex Whedon's background at Meta includes generative AI infrastructure work that's relevant to what they're claiming.

None of that means the claims are correct. It means they're specific enough to be worth testing when access is available.

SubQ's technical blog on how SSA makes long context practical Subquadratic's technical blog describes SSA's three core properties: linear scaling, content-dependent routing, and arbitrary position retrieval. Source: subq.ai


The Three Products

SubQ 1M-Preview API is OpenAI-compatible with tool use and full 1M-token context access. The 12M-token research result isn't in the production API - that window doesn't exist for customers yet. A 50M-token target is set for Q4 2026, which is an ambitious roadmap for a company that hasn't shipped a public product.

SubQ Code is a CLI coding agent built on the same model. The company claims a 25% lower bill and 10x faster codebase exploration when SubQ Code operates as a layer underneath Claude Code, Codex, or Cursor. That's an interesting positioning - not as a replacement for established coding agents but as a long-context retrieval layer that routes token-heavy context lookups to SubQ rather than burning expensive frontier tokens. No independent measurement exists for these claims, but the use case architecture is sensible.

SubQ Search is a free consumer-facing research tool with, per the company, "chatbot-speed" results. It's the lowest-stakes product of the three and probably the most accessible for early testing when access opens.

All three require waitlist registration. Access is not yet public.


What to Make of It

The context window problem Subquadratic is attacking is real. Teams working with large codebases, legal corpora, or long agent sessions have built RAG pipelines, chunking strategies, and orchestration layers exactly because no single model call can hold everything. If SSA delivers actual linear scaling in production, the architectural benefit compounds as context grows - the advantage isn't just cost, it's the elimination of retrieval infrastructure that adds latency, failure modes, and engineering overhead.

The honest answer is that we can't tell yet whether it does. The benchmarks are too narrow. The benchmark numbers are inconsistent across the company's own materials. The production-vs-lab MRCR gap hasn't been explained. The research model's 12M capability isn't in the product. External researchers can't reproduce results because no weights have been released.

Strengths

  • SSA's linear-scaling architecture solves the right problem with a technically coherent approach
  • 52x wall-clock speedup over FlashAttention at 1M tokens is a specific, falsifiable hardware measurement
  • OpenAI-compatible API lowers integration friction
  • Cost efficiency story, if real, changes the economics of long-context inference substantially
  • SubQ Code's positioning as a retrieval layer rather than a full agent replacement is practical

Weaknesses

  • Every benchmark is self-reported or from an unnamed third-party service; no independent reproduction
  • The 17-point MRCR production-vs-lab gap hasn't been addressed publicly
  • Competitor benchmark numbers in launch materials are inconsistent with other sources
  • 12M-token context isn't available in the production API - the headline feature isn't in the product
  • Narrow benchmark coverage: no reasoning, math, multilingual, or safety evaluations published
  • Private beta with no public access, no weights, and no technical paper
  • SWE-Bench at 81.8% trails Opus 4.7 (87.6%) on the coding tasks that matter most for SubQ Code

Verdict

SubQ is an interesting architecture story. It isn't a model you can use yet, and the evidence for its claims is too thin to rely on. The efficiency numbers would matter if confirmed - not because SubQ needs to beat GPT-5.5 on general reasoning, but because the cost case for long-context inference at scale is compelling regardless of who tops the general benchmarks. Linear attention scaling is a truly useful property if it holds.

The test is simple: publish a technical paper, open the API, and let external researchers run MRCR v2 and RULER themselves. Until that happens, the $8 vs $2,600 comparison and the 52x speedup claim are marketing copy with hardware numbers attached. They might be right. Subquadratic hasn't given enough to know.

Read our model card for SubQ for a full breakdown of specifications and benchmarks. And see our initial news coverage of the launch for the funding details and investor roster.

Score: 6.5/10 - Architecture worth watching. Evidence not yet worth betting on.


Sources

Elena Marchetti
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.