Name: Llama 4 Maverick
Author: Meta

Llama 4 Maverick is the model where Meta stopped being cautious with scale. While Scout uses 16 experts, Maverick goes to 128 - eight times the expert count - while keeping the same 17B active parameters per token. The result is a 400B-parameter model that activates less than 5% of its weights on any given forward pass, delivering an ELO score of 1417 on the LMSYS Chatbot Arena that beat GPT-4o and Gemini 2.0 Flash when it launched.

TL;DR

400B total / 17B active MoE with 128 experts - same active parameters as Scout, 4x total capacity
Natively multimodal (text + image) via MetaCLIP early fusion - optimized for up to 5 images
MMLU-Pro 80.5%, GPQA Diamond 69.8%, LiveCodeBench 43.4%
1M token context window - 10x shorter than Scout's 10M
Llama 4 Community License - free up to 700M MAU

Overview

The 128-expert architecture is the defining design choice. Maverick alternates between dense layers and MoE layers, applying expert routing in roughly half of the total layers. Each MoE layer selects from 128 available experts, activating a small subset per token. The total parameter count balloons to 400B, but inference compute stays fixed at 17B active - identical to Scout. The theory is that more experts provide finer-grained specialization: rather than 16 general-purpose experts, 128 experts can each develop narrower expertise in specific domains, languages, or task types.

Training consumed 22 trillion multimodal tokens across text, image, and video data, using 2.38 million H100-80GB GPU hours - less than half of Scout's 5 million hours. That asymmetry is notable: the bigger model trained faster, likely because the MoE layers are more parameter-efficient per training step. The multimodal integration uses the same MetaCLIP early-fusion approach as Scout, jointly attending to text and image tokens from the first Transformer layer.

Meta positioned Maverick as matching DeepSeek V3 on reasoning and coding at less than half the active parameters. The Chatbot Arena ELO of 1417 at launch backed up the quality claims, though Arena scores are famously noisy and sensitive to prompt style. The real question - still being answered by the community - is whether 128 experts provide meaningful quality gains over Scout's 16 on practical workloads, or whether the 4x memory footprint is buying marginal benchmark points.

Key Specifications

Specification	Details
Provider	Meta
Model Family	Llama 4
Architecture	Mixture-of-Experts with Early Fusion
Total Parameters	400B
Active Parameters	17B
Experts	128 (alternating dense + MoE layers)
Context Window	1M tokens
Training Data	~22T tokens (text, image, video)
Training Compute	2.38M GPU hours (H100-80GB)
Knowledge Cutoff	August 2024
Input Modalities	Text, Image (up to 5 images)
Output Modalities	Text, Code
Supported Languages	12 (200 in pre-training)
Quantization	BF16, FP8, INT4 available
Release Date	April 5, 2025
License	Llama 4 Community License

Benchmark Performance

Benchmark	Llama 4 Maverick	Llama 4 Scout	Llama 3.1 405B	Llama 3.3 70B
MMLU-Pro	80.5	74.3	73.4	68.9
GPQA Diamond	69.8	57.2	49.0	50.5
LiveCodeBench	43.4	32.8	27.7	33.3
MGSM (Multilingual)	92.3	90.6	91.6	91.1
MMLU (5-shot)	85.5	79.6	85.2	-
MBPP (Code, 3-shot)	77.6	67.8	74.4	-
MATH (4-shot)	61.2	50.3	53.5	-
DocVQA (test)	94.4	94.4	94.4	-
MathVista	73.7	73.7	-	70.7

The gap between Maverick and Scout is consistent and substantial. MMLU-Pro jumps from 74.3 to 80.5, GPQA Diamond from 57.2 to 69.8, and LiveCodeBench from 32.8 to 43.4. That LiveCodeBench number is particularly significant - it is a 57% improvement over the previous flagship Llama 3.1 405B (27.7), achieved with 17B active parameters instead of 405B. The 128-expert MoE is doing real work here, not just inflating parameter counts.

Compared to the broader competitive landscape, Maverick sits in an interesting position. It matches or beats GPT-4o and Gemini 2.0 Flash across most benchmarks while being open-weight. Against newer models like Gemini 3.1 Pro and Claude Opus 4.6, the gap has opened up - those models were released months later with substantially better reasoning and coding scores. The August 2024 knowledge cutoff also means Maverick is working with increasingly stale information.

Key Capabilities

The 128-expert architecture gives Maverick its quality edge over Scout while maintaining the same inference cost per token. In practice, this means Maverick handles nuanced tasks - ambiguous instructions, complex multi-step reasoning, code generation with subtle constraints - more reliably than Scout. The GPQA Diamond gap (69.8 vs 57.2) is the best illustration: graduate-level scientific reasoning benefits from the finer-grained expert specialization.

Multimodal capabilities are strong across the board. DocVQA at 94.4%, MathVista at 73.7%, and native support for up to 5 images per conversation make Maverick practical for document understanding, chart analysis, and visual reasoning tasks. The early-fusion architecture means these are not afterthought capabilities - the model processes visual and text tokens through the same attention mechanism from the start.

The 1M token context window is a substantial downgrade from Scout's 10M. For most applications, 1M is more than sufficient - that is roughly 750,000 words or several hundred code files. Meta's long-context benchmarks show strong performance on MTOB (book-level translation) tasks, with 50.8% on full-book English-to-Kalamang translation. But if your use case genuinely requires multi-million-token contexts, Scout is the right choice despite the quality tradeoff.

Pricing and Availability

Maverick is available under the Llama 4 Community License - same terms as Scout, free for commercial use under 700M monthly active users. Weights are on HuggingFace in BF16 and FP8 formats.

The 400B total parameters mean BF16 deployment requires approximately 800GB of VRAM - a multi-GPU or multi-node setup. FP8 halves this to around 400GB, and INT4 brings it to approximately 100GB with some quality loss. API providers handle this complexity for you: DeepInfra offers $0.20/M input and $0.60/M output tokens, Groq charges $0.11/$0.34 at 307 tokens/second, and Together.ai runs $0.27/M blended. Prices vary by up to 2.6x across the 12+ providers offering the model.

Strengths

128-expert MoE delivers substantial quality gains over Scout while keeping 17B active parameters
GPQA Diamond (69.8) and LiveCodeBench (43.4) represent large jumps over all previous Llama models
Natively multimodal with strong vision benchmarks (DocVQA 94.4, MathVista 73.7)
Chatbot Arena ELO 1417 at launch confirmed quality in human evaluation
Open weights with FP8 and INT4 quantization options
12+ API providers with competitive pricing

Weaknesses

400B total parameters require significant memory even with quantization (~100GB INT4)
Knowledge cutoff of August 2024 is over 18 months stale
1M context is 10x shorter than Scout - a surprising downgrade within the same model family
Llama 4 Community License is not truly open source - 700M MAU cap and attribution requirements
Newer models like Claude Opus 4.6 and Gemini 3.1 Pro have significantly surpassed Maverick on reasoning
The 128-expert architecture makes fine-tuning impractical for most organizations

Sources: