Claude Opus 4.8 Review: Reliability Over Raw Scores
Claude Opus 4.8 sets new highs on SWE-bench Pro and long-context tasks while a 4x improvement in code flaw detection may matter more than any benchmark number.

Forty-one days after Opus 4.7 shipped, Anthropic released Claude Opus 4.8 on May 28, 2026 - with a $65 billion Series H that pushed its valuation to $965 billion. The timing made the announcement feel crowded, but the model holds up fine on its own terms. Opus 4.8 leads on the two coding benchmarks that matter most right now and delivers a reliability improvement that, in practice, may be more valuable than any point gain on a leaderboard.
TL;DR
- 8.5/10 - the best model for long-context coding and code review reliability; GPT-5.5 wins Terminal-Bench and agentic turn efficiency
- Sets the field's top score on SWE-bench Pro at 69.2%, beating GPT-5.5 by 10.6 points and Gemini 3.1 Pro by 15
- Dynamic Workflows lets a single Claude Code session orchestrate up to 1,000 parallel subagents - still a research preview, but already used for a 750,000-line Rust migration
- Skip it if Gemini 3.1 Pro's price advantage ($2/$12 vs $5/$25 per million tokens) matters more to you than raw coding accuracy
See the full Claude Opus 4.8 model card for specs, pricing tables, and a benchmark breakdown from James Kowalski. This review focuses on what it's like to use.
What's Actually New
Opus 4.8 carries four changes worth tracking: a meaningful SWE-bench Pro improvement, the Dynamic Workflows research preview in Claude Code, an Effort Control API that replaces discrete effort flags with a dial, and a ~4x reduction in how often the model passes flawed code without flagging it. A fifth change - accepting role: "system" entries mid-conversation in the messages array - is a quiet API improvement that matters for prompt caching in long agent loops.
Anthropic also dropped the minimum context window size for prompt caching from 4,096 tokens to 1,024. Short prompts that couldn't cache on 4.7 now cache automatically with no code changes. That's a real cost reduction for high-volume use cases with short system prompts.
Knowledge cutoff is January 2026, unchanged from 4.7.
The Benchmark Story
The headline number is 69.2% on SWE-bench Pro. SWE-bench Pro filters out tasks where models show signs of memorization, so a high score there's harder to earn than the same percentage on the standard verified split. GPT-5.5 sits at 58.6%; Gemini 3.1 Pro at 54.2%. That's not a narrow lead.
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Pro | 69.2% | 64.3% | 58.6% | 54.2% |
| SWE-bench Verified | 88.6% | 87.6% | n/a | 80.6% |
| Terminal-Bench 2.1 | 74.6% | 66.1% | 78.2% | 70.3% |
| OSWorld-Verified | 83.4% | 82.8% | 78.7% | 76.2% |
| Humanity's Last Exam (tools) | 57.9% | 54.7% | 52.2% | 51.4% |
| GraphWalks BFS (1M context) | 68.1% | - | 45.4% | - |
| GDPval-AA (professional work) | 1,890 | 1,753 | 1,769 | 1,314 |
| Misalignment behavior score | 1.9 | 2.5 | - | - |
A few caveats. The OSWorld-Verified improvement from 82.8% to 83.4% is smaller than it looks - Vellum AI notes the harness itself was updated (a zoom-tool bug fix and raised max tokens per turn), so not all of the gain is a clean model improvement. GPQA Diamond is effectively saturated across all top models at 93-94% and no longer tells you anything useful. And GPT-5.5 still wins Terminal-Bench at 78.2% vs 74.6% - if your stack is terminal-heavy DevOps work, that gap is real.
The GraphWalks BFS result at 1M context (68.1% vs GPT-5.5's 45.4%) is the most underreported number in the release. Long-context reasoning at the million-token scale separates meaningfully here, and that's before most teams have workflows built to use it.
Code review quality - specifically catching flaws silently rather than flagging them - is where Opus 4.8's reliability improvements show up most clearly.
Source: pexels.com
Check the SWE-bench coding agent leaderboard and the coding benchmarks leaderboard for third-party replication numbers as they come in.
Dynamic Workflows: A 1,000-Agent Orchestrator
Dynamic Workflows is the most ambitious feature in the release and the least finished. In Claude Code, a single orchestrator session can now spawn up to 1,000 parallel subagents, each with its own full context window. The subagents run independently and the orchestrator aggregates their outputs, with a verification pass before surfacing results to the user.
The practical use case is codebase-scale work: refactoring a large module across many files, running parallel test suites against different configurations, or analyzing a large document corpus. Jarred Sumner used an early version to migrate approximately 750,000 lines of Rust code over 11 days, with human oversight at checkpoints.
That's a real result. But it comes with real caveats. Dynamic Workflows requires Enterprise, Team, or Max plan access. The token costs at scale aren't well characterized - Anthropic hasn't published cost profiles for typical orchestrator runs, and the math at 1,000 subagents adds up fast. It also isn't autonomous: the 750,000-line migration still required human decisions at key branch points.
Dynamic Workflows isn't "set it and forget it." It's a force multiplier for developers who know exactly what they want to parallelize.
The research preview label is honest. This isn't production-ready for teams without solid test suites and engineering capacity to supervise orchestrator behavior. For teams that do have that infrastructure, the ceiling here is genuinely new.
Dynamic Workflows moves multi-agent orchestration inside the model, reducing external scaffolding requirements.
Source: pexels.com
Effort Control: A Better Dial
Opus 4.8 replaces the previous discrete effort flags with five explicit levels: low, medium, high, xhigh, and max. Set via output_config: {"effort": "<level>"} in the Messages API. No beta header required, available on all plans.
The default is high, which is appropriate for most coding and reasoning tasks. xhigh is the recommended setting for long-horizon agentic work (tasks running 30+ minutes, millions of tokens). At low and medium, the model may skip reasoning on simpler problems. At high and above, it almost always thinks before responding.
Two things to know if you're migrating from Opus 4.7. First, temperature, top_p, and top_k aren't supported on Opus 4.8 - setting non-default values returns a 400 error, same as on 4.7. Second, adaptive thinking is the only mode available; the old budget_tokens: N parameter also returns a 400. The Effort Control levels replace both of those knobs with a single, simpler interface.
For multi-step agent pipelines where reasoning depth varies by subtask, this simplification is actually useful. Setting a single parameter per call is less brittle than coordinating effort flags with budget estimates.
Honesty Over Hype
The reliability improvement is the story most coverage has underplayed. Anthropic reports that Opus 4.8 is approximately four times less likely than Opus 4.7 to let code flaws pass without flagging them. The model hits its improved reliability scores mostly by abstaining when uncertain rather than by answering more questions correctly - a different strategy from what drives benchmark scores higher.
Bridgewater Associates described the change in concrete terms: the model "proactively flags issues with inputs and outputs of an analysis, something other models routinely missed and left to users to catch." In a 40-step agentic process, a flawed premise that goes undetected compounds through every later step. The undetected error rate in that scenario drops from roughly 87% to around 40% under the new model. That's a meaningful change for teams running long-horizon code review workflows.
This is also reflected in the misalignment behavior score dropping from 2.5 to 1.9. Whether that number has a precise real-world interpretation is debatable, but the direction is consistent with the code flaw flagging improvement.
One legitimate concern here: prompt injection resistance regressed. The attack success rate on Opus 4.8 is around 7%, up from 2.3% on the prior version. For applications in adversarial environments - chatbots that ingest user-controlled text, agents that process external documents - that's worth factoring into your threat model.
Pricing
Standard pricing matches Opus 4.7 exactly: $5 per million input tokens, $25 per million output tokens. The new fast mode is $10/$50 - double per token, but Anthropic claims 2.5x throughput and 3x lower cost than fast mode on prior Opus versions.
Fast mode access is waitlisted. Claude Code users can enable it right away via the /fast command; API access requires joining a queue at claude.com/fast-mode.
The output cost comparison with Gemini 3.1 Pro is stark. At $12 per million output tokens vs $25, Gemini runs roughly 2.1x cheaper on output alone. Opus 4.8 leads on coding accuracy by a meaningful margin, but for high-volume workloads where the coding advantage doesn't justify the cost multiple, Gemini 3.1 Pro is the sensible choice.
The comparison with GPT-5.5 ($25 vs $30 per million output tokens) favors Opus, but DataCamp's analysis found that Opus takes roughly 30% more turns than GPT-5.5 to complete agentic tasks. Depending on your typical completion length, GPT-5.5's per-task cost may be competitive despite the higher per-token rate.
| Tier | Input | Output |
|---|---|---|
| Standard | $5.00/M | $25.00/M |
| Fast mode | $10.00/M | $50.00/M |
| Extended context (>200K, standard) | $10.00/M | $37.50/M |
Available on Claude API, Claude.ai, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure AI Foundry (200K context limit on Foundry; 1M tokens everywhere else).
Strengths
- SWE-bench Pro at 69.2% leads the field by a major margin - 10.6 points over GPT-5.5, 15 over Gemini 3.1 Pro
- ~4x better code flaw detection rate in production code review workflows
- Long-context reasoning at 1M tokens is the strongest in class (GraphWalks BFS: 68.1% vs GPT-5.5's 45.4%)
- Dynamic Workflows enables genuine codebase-scale parallel agent work, even in preview
- Effort Control simplifies reasoning budget management in agentic pipelines
- Prompt cache minimum dropped to 1,024 tokens - immediate cost savings, no code changes
- Mid-conversation system messages now supported, improving long agent session flexibility
Weaknesses
- GPT-5.5 leads Terminal-Bench 2.1 at 78.2% vs 74.6% - a real gap for terminal-heavy DevOps workflows
- Prompt injection resistance regressed from 2.3% to 7% attack success rate
- Dynamic Workflows is a research preview with no GA date and unclear cost profiles at scale
- ~30% more turns than GPT-5.5 to complete agentic tasks, partially offsetting per-token cost advantage
- Fast mode access is waitlisted for API users
- Gemini 3.1 Pro is 2.1x cheaper on output for teams where cost is the primary constraint
- 41-day release cadence means prompts tuned for 4.7 may need adjustment again
Verdict
8.5/10. Claude Opus 4.8 is the strongest model for complex coding tasks right now, and the 4x code flaw detection improvement is the kind of reliability gain that compounds across long workflows in ways a benchmark score can't capture. The SWE-bench Pro lead over GPT-5.5 and Gemini 3.1 Pro is real and not narrow.
Dynamic Workflows is the feature to watch, but it's a preview, not a shipping product. Treat it as a signal of where Anthropic is going rather than something to build a pipeline around today.
If your work is primarily terminal-heavy DevOps, GPT-5.5 is the better fit at comparable pricing. If cost efficiency matters more than coding accuracy, Gemini 3.1 Pro is substantially cheaper. For long-context coding, code review, and multi-file agentic workflows where reliability under uncertainty matters, Opus 4.8 is the current standard.
The Claude Opus 4.7 review covers what changed in the previous generation. The computer use leaderboard will reflect third-party OSWorld replication numbers once they catch up with the updated harness methodology.
Sources
- Anthropic - Claude Opus 4.8 announcement
- Anthropic - What's New in Claude 4.8 (API docs)
- Anthropic - Effort Control documentation
- TechCrunch - Anthropic releases Opus 4.8 with new Dynamic Workflow tool
- Vellum AI - Claude Opus 4.8 benchmarks explained
- BenchLM - Opus 4.8 vs GPT-5.5
- BenchLM - Opus 4.8 vs Gemini 3.1 Pro
- SiliconAngle - Anthropic launches Opus 4.8 and raises $65B
- Simon Willison - Claude Opus 4.8 notes
- SWE-bench leaderboard
