Muse Spark Review: Strong on Health, Weak on Code

Meta's first proprietary frontier model leads on HealthBench Hard and scientific reasoning but trails rivals in coding and agentic tasks - with no public API yet.

Muse Spark Review: Strong on Health, Weak on Code

Nine months ago, Meta quietly scrapped its entire AI stack. No public announcement, no press release - just a team led by newly appointed Chief AI Officer Alexandr Wang starting from scratch. The result shipped on April 8, 2026 under the name Muse Spark, and it's unlike anything Meta has released before: closed source, proprietary, already live in consumer apps, and built for a very specific kind of intelligence.

TL;DR

  • 7.8/10 - a specialist model that dominates health and science benchmarks, but has real gaps in coding and agentic tasks
  • Scores 42.8 on HealthBench Hard - the highest of any frontier model tested, by a wide margin
  • Trails GPT-5.4 and Gemini 3.1 Pro on coding (Terminal-Bench 59.0 vs 75.1) and abstract reasoning (ARC-AGI-2 42.5 vs 76.1)
  • Free at meta.ai - but no public API and no timeline for developer access

The Reset That Produced Muse Spark

Muse Spark doesn't belong to the Llama lineage. That's the first thing to understand. When Llama 4 underdelivered earlier this year, Meta didn't iterate - it reorganized. Zuckerberg brought in Scale AI co-founder Alexandr Wang under a deal that cost Meta $14.3 billion for a 49% non-voting stake in Scale AI, created a new division called Meta Superintelligence Labs, and handed Wang the title of Chief AI Officer.

Wang's team spent nine months rebuilding the pretraining stack, data pipeline, and post-training methodology from scratch. The decision to ship it as a closed model was itself a reversal of years of open-source policy - a sign that Meta believes it finally has something worth protecting.

Whether that belief is justified depends completely on what you need the model to do.

Muse Spark interface showing object detection and counting in meta.ai chat Muse Spark counting pelicans in a photo with numbered overlays - one of 16+ tools exposed through the meta.ai interface. Source: simonwillison.net

What Muse Spark Actually Is

The model is natively multimodal - trained end-to-end on text and images, not bolted together from separate encoders. Meta describes it as the first in a "Muse family," with Muse Spark functioning as the flagship entry.

On the interface side, meta.ai now exposes two modes: Instant (fast, no extended reasoning) and Thinking (longer chain-of-thought). A third mode, Contemplating, is rolling out gradually. Contemplating is the one worth watching: instead of scaling reasoning by thinking longer - the approach used by Gemini Deep Think or GPT Pro mode - it spins up multiple reasoning agents in parallel and merges their outputs. Thinking wider, not deeper.

The tool ecosystem baked into the chat harness is more capable than the initial launch coverage suggested. Based on testing by Simon Willison, the interface exposes over 16 tools: web search, full page loading, pattern matching against page content, semantic search across Instagram, Threads, and Facebook (post-2025 only), Python 3.9 code execution with pandas, numpy, matplotlib and OpenCV, image generation in both artistic and realistic modes, and visual grounding that returns object coordinates, bounding boxes, and counts.

The Python sandbox is a real differentiator. You can create an image and then immediately run OpenCV analysis on it - color histograms, edge detection - within the same conversation thread. That's a genuinely useful workflow that most competitors don't support end-to-end.

One caveat: the sandbox runs Python 3.9, which reached end-of-life in October 2024. That's a strange choice for a model launched in April 2026.

Benchmark Performance

The numbers tell a story of real specialization rather than across-the-board superiority.

BenchmarkMuse SparkGPT-5.4Gemini 3.1 ProClaude Opus 4.6
AA Intelligence Index52575753
HealthBench Hard42.840.120.6-
HLE (Contemplating mode)50.2%43.9%48.4%-
MMMU-Pro (vision)80.5%-82.4%-
Terminal-Bench 2.059.075.168.5-
ARC-AGI-242.576.176.5-
GDPval-AA (agentic)1,444 ELO1,6721,3201,648

On the Artificial Analysis Intelligence Index v4.0, Muse Spark scores 52 - fourth overall, behind GPT-5.4 and Gemini 3.1 Pro (both at 57) and Claude Opus 4.6 (53). That's a respectable position for a first release from a new team.

The HealthBench Hard result is the number that'll define this model's reputation. A score of 42.8 against GPT-5.4's 40.1 might look narrow on paper, but Gemini 3.1 Pro's 20.6 shows how wide the gap actually is at the top. Meta worked with over 1,000 physicians to curate specialized health training data, and the benchmark reflects that investment. For open-ended health queries - differential diagnosis support, medication information, symptom triage - Muse Spark is currently the most capable frontier model available.

Scientific reasoning follows a similar pattern. On Humanity's Last Exam in Contemplating mode, Muse Spark hits 50.2%, above both GPT-5.4 Pro at 43.9% and Gemini Deep Think at 48.4%.

Where it falls short matters, too. The ARC-AGI-2 gap is sobering - 42.5 versus GPT-5.4's 76.1. Abstract reasoning at that level of deficit isn't a minor weakness. It means Muse Spark will struggle with novel problem structures that don't pattern-match against its training distribution. Coding gaps are similarly marked: Terminal-Bench 59.0 versus 75.1 for GPT-5.4 is a 16-point difference on a benchmark where the ceiling matters for real engineering tasks.

On HealthBench Hard, Muse Spark scores 42.8. Gemini 3.1 Pro scores 20.6. That isn't a benchmark gap - it's a different category.

Token Efficiency

One number that deserves more attention: Muse Spark used 58 million output tokens to complete the full Intelligence Index evaluation. Gemini 3.1 Pro used 57 million. Claude Opus 4.6 used 157 million. GPT-5.4 used 120 million.

Getting fourth-place intelligence scores at second-place efficiency is a legitimate architectural achievement, particularly for the Contemplating mode - running multiple parallel agents and then synthesizing them without producing bloated token counts suggests thoughtful design around the orchestration layer.

Visual Capabilities

Muse Spark ranks second among frontier models on MMMU-Pro at 80.5%, just behind Gemini 3.1 Pro Preview at 82.4%. More practically, the visual grounding capabilities are precise: in public testing, the model counted 25 pelicans in a photo with numbered overlays, identified 12 whiskers and 8 claws on a generated animal image, and returned nested bounding boxes showing hierarchical object relationships.

Meta's visual grounding returning bounding box coordinates for detected objects Muse Spark's visual grounding mode returning object coordinates and bounding boxes - it supports point detection, box detection, and hierarchical nesting. Source: simonwillison.net

The lack of pixel-level segmentation masks is a real limitation for precision tasks like medical image analysis, which is ironic given the model's health focus. OpenCV-based alternatives are possible within the sandbox, but that's friction.

The Availability Problem

Muse Spark is free at meta.ai and in the Meta AI app. For a consumer AI assistant, that's fine. For developers, researchers, or anyone building products, the absence of a public API is a serious constraint.

There's a private API preview available to select enterprise partners, but no self-service access, no pricing, and no announced timeline. Given that GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro are all available via API today, Meta's closed preview model is effectively inaccessible to the developer community that most cares about benchmark performance.

Meta says API access is coming. Until it arrives, Muse Spark's real-world impact is limited to whatever Meta itself ships into WhatsApp, Instagram, Facebook, and its Ray-Ban glasses hardware.

Official Muse Spark announcement image from Meta Superintelligence Labs The official announcement banner from Meta Superintelligence Labs, which shipped Muse Spark nine months after being formed under Alexandr Wang. Source: about.fb.com

Strengths

  • HealthBench Hard dominance - the largest margin over any competitor on any benchmark in this review cycle
  • Token efficiency - 58M tokens for the full Intelligence Index, matching Gemini at a fraction of Opus's cost
  • Parallel reasoning - Contemplating mode's multi-agent architecture is a genuine architectural innovation, not just a rebrand of extended thinking
  • Tool ecosystem - 16+ integrated tools including code execution, visual grounding, and cross-platform Meta content search
  • Vision performance - second place on MMMU-Pro at 80.5%, close to the top

Weaknesses

  • No public API - consumer-only access blocks the developer and research use cases that would benefit most from the health specialization
  • Coding gap - 16 points behind GPT-5.4 on Terminal-Bench 2.0 isn't a narrow miss
  • ARC-AGI-2 regression - 42.5 vs 76.1 for GPT-5.4 suggests weak abstract generalization that will show up in novel tasks
  • Agentic tasks - 1,444 ELO on GDPval-AA trails Claude Sonnet 4.6 by over 200 points
  • Python 3.9 sandbox - end-of-life runtime in a 2026 product is an odd choice

Verdict

Muse Spark is a real step forward, and not in the way Meta's previous releases have been. It doesn't try to be a universal upgrade over GPT-5.4 or Claude Opus 4.6. It's a specialist: best-in-class for health and science, competitive on vision, truly novel in its parallel reasoning architecture, and free at the consumer level.

The gaps in coding and abstract reasoning aren't minor quirks. They're structural limits that make this the wrong choice for software engineering, complex agentic workflows, or novel reasoning tasks. Meta's benchmarks for health and science are impressive, but the model scores 4th on the overall Intelligence Index for a reason.

The bigger problem is access. A health AI model without API access can't power clinician decision-support tools, health apps, or research pipelines. Until Meta opens the API, Muse Spark's most impressive capabilities live behind a chat interface that's designed for consumers, not builders.

Score: 7.8/10 - Worth using for health and scientific queries via meta.ai today. Worth monitoring closely for API availability. Not the right choice for coding, agents, or general-purpose workflows where GPT-5.4 or Claude Opus 4.6 are better fits.


Sources

Muse Spark Review: Strong on Health, Weak on Code
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.