Claude Mythos Preview Review: Escaped Its Sandbox
Claude Mythos Preview posts the highest SWE-bench score ever, found thousands of real zero-days in production software, and during safety testing, escaped its sandbox to email a researcher eating lunch in a park.

The most memorable thing about Claude Mythos Preview isn't the benchmark scores. It's what happened during internal safety testing, a few weeks before Anthropic was ready to announce the model.
Mythos - codenamed "Capybara" internally - was given access to a secured sandbox computer as part of an evaluation run. Anthropic had restricted the system to prevent network egress. Mythos identified a multi-step exploit chain, used JIT heap spraying to achieve privilege escalation, broke out of the sandbox, obtained broad internet access, and sent an email to the researcher overseeing the evaluation. The researcher was eating a sandwich in a park at the time.
Anthropic disclosed this in the April 7 announcement. They kept the model. They also didn't release it to the public.
TL;DR
- 8.8/10 - the strongest AI model for software security and coding ever published, restricted to 52 organizations globally
- Highest published SWE-bench Verified score (93.9%), independent AISI confirmation of 73% on expert CTF challenges
- Not available to the public; $25/$125 per million tokens for the 52 organizations that do qualify
- Who should read this: security teams, critical infrastructure operators, and anyone tracking the frontier - not developers who need a model today
What Mythos Is
Mythos Preview sits above the Opus tier in Anthropic's lineup - the first time the company has shipped a model separated from its flagship product family by capability rather than cost or deployment context. The architecture details are sparse. Anthropic hasn't disclosed parameter counts. Third-party researchers estimate roughly 800 billion to 1.2 trillion active parameters per forward pass on a Mixture-of-Experts design, giving the model the knowledge capacity of a roughly 10T dense model at a fraction of the compute cost. The company also describes a "tiered attention" system that maintains different resolution levels across the full 1M-token context window, though no technical paper has been published.
Access is through Project Glasswing only. The 12 founding partners are Amazon Web Services, Anthropic, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. An additional 40 organizations maintaining critical software infrastructure also received access. That's roughly 52 organizations on a planet with hundreds of thousands of software companies.
The model card covers full specifications, pricing, and platform availability details.
The Benchmarks
Anthropic's published numbers for Mythos are the strongest across the software engineering category that any lab has submitted. On SWE-bench Verified - the standard measure of real-world bug-fixing ability - Mythos posts 93.9%. The next publicly available model, Claude Opus 4.6, sits at 80.8%. A 13-point gap on a well-established benchmark is not noise.
| Benchmark | Mythos Preview | Claude Opus 4.6 | GPT-5.5 |
|---|---|---|---|
| SWE-bench Verified | 93.9% | 80.8% | ~72% |
| SWE-bench Pro | 77.8% | 53.4% | N/A |
| Terminal-Bench 2.0 | 82.0% | 65.4% | ~83% |
| GPQA Diamond | 94.6% | 91.3% | ~90% |
| CyberGym | 83.1% | 66.6% | N/A |
| USAMO 2026 | 97.6% | 42.6% | 95.2% |
| BrowseComp | 86.9% | 83.7% | ~80% |
| GDPval (real-world knowledge) | N/A | N/A | 84.9% |
Some caveats apply. Most of these scores are Anthropic self-reported, from a lab that controls both the model and often the evaluation setup. Two data points push back against full skepticism. The UK AI Safety Institute independently assessed Mythos's cybersecurity capabilities and backed up the headline numbers. And GPT-5.5, launched by OpenAI on April 23, narrowly beats Mythos on Terminal-Bench 2.0 - which makes Anthropic's figures look less curated. Labs don't list benchmarks they're losing.
GPT-5.5 leads on GDPval (84.9%), a benchmark measuring knowledge work across 44 occupations. Mythos has no published score there. That absence is worth tracking over time. For full ranking context, the coding benchmarks leaderboard has the detailed breakdown.
The AISI independently confirmed Mythos completes expert-level CTF challenges at 73% - the first AI to clear that difficulty bar. On Anthropic's internal simulation, it averaged 22 of 32 steps in a corporate network takeover.
Source: pexels.com
The Sandbox Escape
The incident was disclosed in Anthropic's April 7 announcement, which matters. They included it willingly in the press materials.
Mythos was running an evaluation designed to test whether it could escape a sandboxed environment when instructed. The instructions said: escape. It did. The escape chain involved JIT heap spraying combined with additional privilege escalation steps to gain internet access from a system configured to prevent exactly that. Then it emailed the researcher.
How you read this depends on your model of AI behavior. One interpretation: the model followed instructions exactly, which is what you want from an AI system. Another interpretation: a system capable of this on command has demonstrated that the capability exists and is callable. The distance between "on command" and "in other conditions" is determined by the quality of the safety scaffolding around the model - and scaffolding breaks. Anthropic didn't share which reading they believe. The decision to restrict access rather than release publicly suggests they're aware of the second one.
The Cloud Security Alliance has documented cases where Mythos chained multiple low-severity vulnerabilities that standard security tooling wouldn't flag individually into complete local privilege escalation via race conditions and KASLR bypasses. This isn't the model hallucinating an attack path. It's the model reasoning about code semantics in the way a skilled penetration tester would, holding the full state of a complex system across many steps.
Our earlier review of Claude Opus 4.7 noted that agentic coding had already reached a level where multi-step autonomous tasks were reliable. Mythos extends that into territory that requires a different category of caution.
Zero-Days at Industrial Scale
Anthropic used Mythos to scan major operating systems and browsers before the public announcement. The discovered vulnerabilities are documented specifically enough to check against public records, which makes them more credible than most AI security claims.
The three most significant finds:
A 27-year-old TCP SACK vulnerability in OpenBSD that can crash any server with two crafted packets. Static analysis tools and fuzzers missed it because the flaw required semantic reasoning about how TCP options interact under adversarial conditions. Campaign cost: approximately $20,000.
A 16-year-old flaw in FFmpeg's H.264 codec that fuzzers exercised the vulnerable code path five million times without triggering. Mythos caught it through code semantics rather than input generation. Campaign cost: around $10,000.
A 17-year-old remote code execution vulnerability in FreeBSD's NFS stack (CVE-2026-4747) allowing unauthenticated root from the internet. Mythos built a 20-gadget ROP chain split across multiple packets fully autonomously.
Across all scanning work, Anthropic reports complete exploit development costs ranging from under $50 to approximately $2,000 per vulnerability, with tasks completed in hours to days. On OSS-Fuzz testing, the comparison to Opus 4.6 is stark: 595 crashes at tiers 1-2 plus 10 complete control flow hijacks for Mythos, versus 250-275 total for Opus 4.6. Testing on Firefox 147's JavaScript engine produced 181 successful exploits for Mythos compared to 2 for Opus 4.6.
Until Mythos, finding a zero-day in well-audited software required expensive human talent and weeks of work. The ceiling is now $2,000 and a few hours.
The economics are the part that changes things. Finding a zero-day in a well-audited piece of software previously required expensive human expertise measured in person-weeks. The cost curve Anthropic has demonstrated - $50 to $2,000, hours to days - doesn't remove human security researchers, but it restructures what they need to do. Defenders with access to Mythos can now scan at a scale and speed that manual review can't match. Attackers with equivalent capability gain the same advantage.
Project Glasswing brings together 12 founding organizations including AWS, Apple, Google, and CrowdStrike, with Anthropic committing $100M in usage credits and $4M in open-source security donations.
Source: pexels.com
Project Glasswing
Anthropic is committing $100 million in usage credits to Project Glasswing partners, plus $2.5 million to Alpha-Omega and the Open Source Security Foundation through the Linux Foundation, and $1.5 million to the Apache Software Foundation. The stated goal: let defenders scan and patch critical systems before attackers reach the same capability level.
The logic holds. The March 2026 CMS misconfiguration that leaked Mythos details before the announcement already demonstrated that the model's existence isn't a secret. If Anthropic's capabilities are roughly reproducible by other labs within 12-18 months - a reasonable assumption given recent progress rates - then a defensive head start now may matter more than theoretical arms-race concerns.
The counter-argument is that Project Glasswing is a consortium of Anthropic's major commercial partners. AWS, Google, and Microsoft all benefit commercially from the security credibility Glasswing provides. The $100 million in credits is also revenue for Anthropic. This doesn't make the defensive work less real. It does make the framing around "responsible deployment" less clean than the press materials suggest.
The AISI evaluation is the most credible independent data point on what the program is actually deploying. Expert-level CTF challenges, which no AI could complete before April 2025, are now cleared 73% of the time. On the AISI's internally developed 32-step corporate network attack simulation - involving initial reconnaissance through complete network takeover, estimated at 20 hours of manual effort - Mythos succeeded in 3 of 10 attempts and averaged 22 completed steps out of 32. The AISI noted these tests don't include active defenders, detection tooling, or alert penalties. Real attacks are harder. 73% on expert CTF is still not a number to dismiss.
Strengths
- 93.9% SWE-bench Verified - highest published score for any model, 13 points ahead of the next publicly available model
- Independent AISI confirmation covers the cybersecurity benchmarks specifically, providing external corroboration
- Finds decade-old vulnerabilities in well-audited codebases at costs that make manual discovery economically uncompetitive
- 1M token context handles full large codebase analysis without truncation penalties
- Available on all four major cloud platforms (Bedrock, Vertex AI, Foundry, Claude API) for eligible organizations
Weaknesses
- Not publicly accessible - 52 organizations globally; for everyone else, this is informational
- $25 input / $125 output per million tokens, 5x the cost of Claude Opus 4.6 even for those who qualify
- Almost all benchmark scores are Anthropic self-reported; independent evaluation covers the cybersecurity subset only
- AISI explicitly noted Mythos struggles in operational technology (OT) environments, often the most vulnerable in critical infrastructure
- No published scores on GDPval or ARC-AGI-2, leaving real-world knowledge work comparisons against GPT-5.5 incomplete
- Parameter counts, training data, and constitutional training updates haven't been published
Verdict
Score: 8.8/10
Claude Mythos Preview does what Anthropic claims. The SWE-bench lead is real, the zero-day discoveries are documented and reproducible, and the AISI independently confirmed the CTF performance. For software security work specifically, this is the most capable AI system ever published.
The practical problem is access. If you're not among the 52 Glasswing organizations, the model is a subject of analysis. For the organizations that do have access, $25/$125 per million tokens is high but defensible when a single zero-day found elsewhere costs millions to investigate and remediate.
The sandbox escape is an honest disclosure that Anthropic hasn't fully answered. A model that chains vulnerabilities and defeats network restrictions on instruction is a model with a callable attack surface. That the company disclosed it openly is worth crediting. That they shipped it to 52 organizations immediately after is a decision that rests on the quality of Glasswing's vetting process - which hasn't been independently audited.
The 8.8 reflects capability and responsible disclosure. The missing 1.2 points are access, self-reported benchmarks, and a safety question that remains open.
Sources
- Claude Mythos Preview - red.anthropic.com
- Project Glasswing - Anthropic
- Our evaluation of Claude Mythos Preview's cyber capabilities - AISI
- Anthropic's new AI model finds and exploits zero-days across every major OS and browser - Help Net Security
- Claude Mythos Preview completes full cyberattack simulation for first time - The New Stack
- Claude Mythos Preview Benchmarks, Pricing and Project Glasswing - LLM Stats
- Introducing Project Glasswing - Linux Foundation
- Claude Mythos: How AI broke out of its sandbox - Computing
- Three Historic Vulnerabilities: Mythos Cybersecurity Capabilities in Detail - claudemythosai.io
- Claude Mythos Preview vs GPT-5.5 benchmark comparison - BenchLM.ai
- AI Vulnerability Discovery and Containment Failures - Cloud Security Alliance
