Muse Spark

Meta's first closed-source frontier model scores 52 on the Artificial Analysis Intelligence Index, leads on HealthBench Hard, and ships free at meta.ai - but has no public API yet.

Muse Spark

Overview

Meta shipped Muse Spark on April 8, 2026 - the first model from its Meta Superintelligence Labs (MSL) division and the first in what the company is calling the Muse series. The release marks a sharp break from Meta's Llama lineage: where Llama was open-weight and community-focused, Muse Spark is proprietary, closed-source, and rolled out directly into Meta's consumer apps.

TL;DR

  • 4th on the Artificial Analysis Intelligence Index (score: 52), behind Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6
  • Leads all tested models on HealthBench Hard (42.8) and Humanity's Last Exam in Contemplating mode (50.2%)
  • Built in nine months by Meta Superintelligence Labs under Chief AI Officer Alexandr Wang
  • Free at meta.ai with 262K context; no public API yet

The model was built over nine months by a team led by Alexandr Wang, who joined Meta in June 2025 as its first Chief AI Officer after Meta picked up a 49% stake in his company Scale AI for roughly $14 billion. Muse Spark is described as "the first step on our scaling ladder" - small and fast by design, intended to verify the new MSL architecture before the team scales further.

The competitive position is honest, if not spectacular. Muse Spark scores 52 on the Artificial Analysis Intelligence Index, placing 4th overall behind Gemini 3.1 Pro (57), GPT-5.4 (57), and Claude Opus 4.6 (53). It wins on health and scientific reasoning, loses on abstract reasoning and coding. That's a more specific story than Meta's marketing implies, and it's worth examining the numbers in detail.

Key Specifications

SpecificationDetails
ProviderMeta
Model FamilyMuse
ParametersNot disclosed
Context Window262,000 tokens
Input PriceFree (no public API)
Output PriceFree (no public API)
Release DateApril 8, 2026
LicenseProprietary
Open SourceNo
Input ModalitiesText, image, voice
Output ModalitiesText only

Benchmark Performance

The benchmarks split into two stories depending on which mode you're evaluating. Thinking mode is Muse Spark's standard reasoning layer. Contemplating mode - which coordinates multiple specialized agents in parallel before returning a synthesized answer - is the more expensive inference path and the one that competes with Gemini 3.1 Pro Deep Think and GPT-5.4 Pro.

Muse Spark benchmark comparison chart from Meta's release materials Meta's official benchmark comparison for Muse Spark in Thinking mode across health, vision, and reasoning tasks. Source: officechai.com

Thinking Mode

BenchmarkMuse SparkGPT-5.4Gemini 3.1 ProClaude Opus 4.6
AA Intelligence Index52575753
GPQA Diamond89.5%92.8%94.3%92.7%
HealthBench Hard42.840.120.614.8
CharXiv Reasoning86.482.880.2-
MMMU-Pro (Vision)80.5%-82.4%-
ARC-AGI-242.576.176.5-
Terminal-Bench 2.059.075.168.5-
DeepSearchQA74.8-69.7-

Contemplating Mode

BenchmarkMuse SparkGPT-5.4 ProGemini Deep Think
Humanity's Last Exam50.2%43.9%48.4%
FrontierScience Research38.3%36.7%23.3%
IPhO 2025 Theory82.6%93.5%87.7%

The health numbers are genuine. HealthBench Hard at 42.8 is more than double Gemini 3.1 Pro's 20.6 - not a marginal win but a structural difference, attributable to Meta's collaboration with over 1,000 physicians during training. The Humanity's Last Exam result in Contemplating mode is also independently interesting: 50.2% beats every tested competitor, including models purpose-built for extended reasoning.

The weaknesses are equally clear and appear in categories that matter to developers. ARC-AGI-2 at 42.5 is roughly half the score of GPT-5.4 and Gemini 3.1 Pro. Terminal-Bench 2.0 at 59.0 trails both competitors by wide margins. These gaps are acknowledged by Meta - the company says in its release materials that there is "a gap between Muse Spark and the models already available" on coding tasks. That's a significant admission when many enterprise use cases are code-heavy. For full coding comparisons see the coding benchmarks leaderboard.

One flag worth noting: most benchmark scores above come from Meta's own published evaluations. The Artificial Analysis Intelligence Index score (52) is independently measured. For categories where only Meta's numbers exist - especially health - treat them as directional rather than final.

Key Capabilities

Muse Spark is natively multimodal, accepting text, image, and voice inputs, though output is text-only at launch. The image and voice channels work in practice: Simon Willison documented the model accurately counting discrete objects (12 whiskers on a cat, 25 pelicans) with coordinate precision, which suggests the visual grounding implementation is solid rather than decorative.

The tool suite built into meta.ai is worth examining. Willison identified 16 tools exposed through the interface including container.python_execution (Python 3.9.25 with pandas, numpy, matplotlib, scikit-learn), browser.search, container.visual_grounding, media.image_gen, and subagents.spawn_agent. The last one is architecturally interesting: users can trigger Contemplating mode, which spins up multiple reasoning agents in parallel before producing a single answer. That's similar in spirit to how Grok 4.20 approaches complex queries via its parallel agent system, though Meta's implementation is distinct.

Meta trained Muse Spark with input from over 1,000 physicians. The HealthBench Hard score of 42.8 - triple the Gemini 3.1 Pro result - suggests the investment translated into model behavior.

The health capability is the most differentiated feature in the launch. Meta collaborated with physicians across the training process, and the model can parse medical images, produce interactive health displays explaining nutrition and physiology, and answer clinical questions at a level no other frontier model currently matches on HealthBench Hard. Whether that translates to real-world clinical accuracy is a separate question the benchmarks don't answer.

Meta also cites significant compute efficiency: Muse Spark reportedly reaches capabilities comparable to Llama 4 Maverick using more than 10x less compute. If that holds at scale, it has consequences for how Meta prices the API when it eventually opens.

Pricing and Availability

Muse Spark is free to use at meta.ai and via the Meta AI app. No API pricing has been announced. A private API preview is available to select partners, but no public release date has been set.

The model is rolling out to WhatsApp, Instagram, Facebook, Messenger, and Meta's Ray-Ban AI glasses over the coming weeks, replacing the Llama-based models that powered Meta AI previously.

Meta plans to open-source future versions of the Muse series, maintaining a two-track strategy: Muse for closed, competitive frontier products and Llama for open community releases. The timing on any open-source release is unspecified.

For developers, the lack of a public API is the limiting factor right now. There is no pricing to assess, no rate limits to benchmark, and no programmatic access for integration work. That may change quickly given Meta's stated commercial ambitions, but until it does, Muse Spark is a consumer product, not a developer platform.

Strengths and Weaknesses

Strengths

  • Leads all benchmarked models on HealthBench Hard (42.8) and HealthBench general tasks
  • Highest Humanity's Last Exam score in Contemplating mode (50.2%) among tested models
  • Strong vision capability - MMMU-Pro at 80.5%, second only to Gemini 3.1 Pro Preview
  • Competitive CharXiv Reasoning score (86.4), best among tested models
  • Free access with no usage limits via meta.ai
  • Compute-efficient: comparable Llama 4 Maverick capability at over 10x lower compute cost

Weaknesses

  • No public API and no announced pricing timeline
  • Weak abstract reasoning: ARC-AGI-2 at 42.5 vs. 76.5 for Gemini 3.1 Pro
  • Below-par coding performance: Terminal-Bench 2.0 at 59.0 vs. 75.1 for GPT-5.4
  • Proprietary - no open weights, no fine-tuning, no self-hosting
  • Meta-reported benchmarks outside health domains need independent verification
  • Text-only output; no image or audio generation through the main interface

Sources:

✓ Last verified April 9, 2026

Muse Spark
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.