Models

Llama 4 Maverick

Meta's Llama 4 Maverick packs 400B total parameters into a 128-expert MoE architecture with only 17B active per token, beating GPT-4o on Chatbot Arena while matching DeepSeek V3 on reasoning at half the active parameters.

Llama 4 Maverick

Llama 4 Maverick is the model where Meta stopped being cautious with scale. While Scout uses 16 experts, Maverick goes to 128 - eight times the expert count - while keeping the same 17B active parameters per token. The result is a 400B-parameter model that activates less than 5% of its weights on any given forward pass, delivering an ELO score of 1417 on the LMSYS Chatbot Arena that beat GPT-4o and Gemini 2.0 Flash when it launched.

TL;DR

  • 400B total / 17B active MoE with 128 experts - same active parameters as Scout, 4x total capacity
  • Natively multimodal (text + image) via MetaCLIP early fusion - optimized for up to 5 images
  • MMLU-Pro 80.5%, GPQA Diamond 69.8%, LiveCodeBench 43.4%
  • 1M token context window - 10x shorter than Scout's 10M
  • Llama 4 Community License - free up to 700M MAU

Overview

The 128-expert architecture is the defining design choice. Maverick alternates between dense layers and MoE layers, applying expert routing in roughly half of the total layers. Each MoE layer selects from 128 available experts, activating a small subset per token. The total parameter count balloons to 400B, but inference compute stays fixed at 17B active - identical to Scout. The theory is that more experts provide finer-grained specialization: rather than 16 general-purpose experts, 128 experts can each develop narrower expertise in specific domains, languages, or task types.

Training consumed 22 trillion multimodal tokens across text, image, and video data, using 2.38 million H100-80GB GPU hours - less than half of Scout's 5 million hours. That asymmetry is notable: the bigger model trained faster, likely because the MoE layers are more parameter-efficient per training step. The multimodal integration uses the same MetaCLIP early-fusion approach as Scout, jointly attending to text and image tokens from the first Transformer layer.

Meta positioned Maverick as matching DeepSeek V3 on reasoning and coding at less than half the active parameters. The Chatbot Arena ELO of 1417 at launch backed up the quality claims, though Arena scores are famously noisy and sensitive to prompt style. The real question - still being answered by the community - is whether 128 experts provide meaningful quality gains over Scout's 16 on practical workloads, or whether the 4x memory footprint is buying marginal benchmark points.

Key Specifications

SpecificationDetails
ProviderMeta
Model FamilyLlama 4
ArchitectureMixture-of-Experts with Early Fusion
Total Parameters400B
Active Parameters17B
Experts128 (alternating dense + MoE layers)
Context Window1M tokens
Training Data~22T tokens (text, image, video)
Training Compute2.38M GPU hours (H100-80GB)
Knowledge CutoffAugust 2024
Input ModalitiesText, Image (up to 5 images)
Output ModalitiesText, Code
Supported Languages12 (200 in pre-training)
QuantizationBF16, FP8, INT4 available
Release DateApril 5, 2025
LicenseLlama 4 Community License

Benchmark Performance

BenchmarkLlama 4 MaverickLlama 4 ScoutLlama 3.1 405BLlama 3.3 70B
MMLU-Pro80.574.373.468.9
GPQA Diamond69.857.249.050.5
LiveCodeBench43.432.827.733.3
MGSM (Multilingual)92.390.691.691.1
MMLU (5-shot)85.579.685.2-
MBPP (Code, 3-shot)77.667.874.4-
MATH (4-shot)61.250.353.5-
DocVQA (test)94.494.494.4-
MathVista73.773.7-70.7

The gap between Maverick and Scout is consistent and substantial. MMLU-Pro jumps from 74.3 to 80.5, GPQA Diamond from 57.2 to 69.8, and LiveCodeBench from 32.8 to 43.4. That LiveCodeBench number is particularly significant - it is a 57% improvement over the previous flagship Llama 3.1 405B (27.7), achieved with 17B active parameters instead of 405B. The 128-expert MoE is doing real work here, not just inflating parameter counts.

Compared to the broader competitive landscape, Maverick sits in an interesting position. It matches or beats GPT-4o and Gemini 2.0 Flash across most benchmarks while being open-weight. Against newer models like Gemini 3.1 Pro and Claude Opus 4.6, the gap has opened up - those models were released months later with substantially better reasoning and coding scores. The August 2024 knowledge cutoff also means Maverick is working with increasingly stale information.

Key Capabilities

The 128-expert architecture gives Maverick its quality edge over Scout while maintaining the same inference cost per token. In practice, this means Maverick handles nuanced tasks - ambiguous instructions, complex multi-step reasoning, code generation with subtle constraints - more reliably than Scout. The GPQA Diamond gap (69.8 vs 57.2) is the best illustration: graduate-level scientific reasoning benefits from the finer-grained expert specialization.

Multimodal capabilities are strong across the board. DocVQA at 94.4%, MathVista at 73.7%, and native support for up to 5 images per conversation make Maverick practical for document understanding, chart analysis, and visual reasoning tasks. The early-fusion architecture means these are not afterthought capabilities - the model processes visual and text tokens through the same attention mechanism from the start.

The 1M token context window is a substantial downgrade from Scout's 10M. For most applications, 1M is more than sufficient - that is roughly 750,000 words or several hundred code files. Meta's long-context benchmarks show strong performance on MTOB (book-level translation) tasks, with 50.8% on full-book English-to-Kalamang translation. But if your use case genuinely requires multi-million-token contexts, Scout is the right choice despite the quality tradeoff.

Pricing and Availability

Maverick is available under the Llama 4 Community License - same terms as Scout, free for commercial use under 700M monthly active users. Weights are on HuggingFace in BF16 and FP8 formats.

The 400B total parameters mean BF16 deployment requires approximately 800GB of VRAM - a multi-GPU or multi-node setup. FP8 halves this to around 400GB, and INT4 brings it to approximately 100GB with some quality loss. API providers handle this complexity for you: DeepInfra offers $0.20/M input and $0.60/M output tokens, Groq charges $0.11/$0.34 at 307 tokens/second, and Together.ai runs $0.27/M blended. Prices vary by up to 2.6x across the 12+ providers offering the model.

Strengths

  • 128-expert MoE delivers substantial quality gains over Scout while keeping 17B active parameters
  • GPQA Diamond (69.8) and LiveCodeBench (43.4) represent large jumps over all previous Llama models
  • Natively multimodal with strong vision benchmarks (DocVQA 94.4, MathVista 73.7)
  • Chatbot Arena ELO 1417 at launch confirmed quality in human evaluation
  • Open weights with FP8 and INT4 quantization options
  • 12+ API providers with competitive pricing

Weaknesses

  • 400B total parameters require significant memory even with quantization (~100GB INT4)
  • Knowledge cutoff of August 2024 is over 18 months stale
  • 1M context is 10x shorter than Scout - a surprising downgrade within the same model family
  • Llama 4 Community License is not truly open source - 700M MAU cap and attribution requirements
  • Newer models like Claude Opus 4.6 and Gemini 3.1 Pro have significantly surpassed Maverick on reasoning
  • The 128-expert architecture makes fine-tuning impractical for most organizations

Sources:

Llama 4 Maverick
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.