Llama 4 Scout
Meta's Llama 4 Scout is a 109B-total, 17B-active MoE model with 16 experts and a 10M-token context window - the longest of any open-weight model - with native multimodal support for text and images.

Llama 4 Scout is Meta's entry-level model in the Llama 4 family, and "entry-level" is doing a lot of heavy lifting when you are talking about 109 billion total parameters, 17 billion active, and a 10 million token context window. Released in April 2025, Scout was Meta's first MoE architecture in the Llama lineup - 16 experts with 2 active per forward pass - and it shipped with native multimodal capabilities baked in through early fusion rather than bolted on after the fact.
TL;DR
- 109B total / 17B active MoE with 16 experts (2 active per token) - natively multimodal (text + image)
- 10M token context window - longest of any open-weight model at launch
- MMLU-Pro 74.3%, GPQA Diamond 57.2%, LiveCodeBench 32.8%
- Fits on a single H100 with INT4 quantization (~27GB + overhead)
- Llama 4 Community License - free for up to 700M monthly active users
Overview
Scout's 10M token context window was the headline number at launch, and it remains the defining feature. Meta claims perfect retrieval accuracy across all depths within that window - meaning the model can find a needle anywhere in a 10-million-token haystack. For RAG pipelines, document analysis, or codebases that span thousands of files, this is a fundamentally different capability tier than the 128K-256K windows most competitors offer.
The architecture uses a standard MoE approach - 16 experts per MoE layer, 2 routed per token - giving 17B active parameters from the 109B total. The multimodal integration uses Meta's enhanced MetaCLIP vision encoder, which converts visual tokens and feeds them through early fusion so the Transformer attends to text and image tokens jointly from the first layer. This is architecturally cleaner than adapter-based approaches, though it means the model was trained multimodal from scratch rather than fine-tuned onto a text-only backbone.
Training consumed approximately 40 trillion tokens across text, image, and video data, with 5 million GPU hours on H100-80GB hardware. The model supports 12 languages natively (Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, Vietnamese) with pre-training coverage across 200 languages. The knowledge cutoff is August 2024 - worth noting since this is now over a year old.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Meta |
| Model Family | Llama 4 |
| Architecture | Mixture-of-Experts with Early Fusion |
| Total Parameters | 109B |
| Active Parameters | 17B |
| Experts | 16 (2 active per token) |
| Context Window | 10M tokens |
| Training Data | ~40T tokens (text, image, video) |
| Training Compute | 5.0M GPU hours (H100-80GB) |
| Knowledge Cutoff | August 2024 |
| Input Modalities | Text, Image |
| Output Modalities | Text, Code |
| Supported Languages | 12 (200 in pre-training) |
| Quantization | BF16, FP8, INT4 available |
| Release Date | April 5, 2025 |
| License | Llama 4 Community License |
Benchmark Performance
| Benchmark | Llama 4 Scout | Llama 3.3 70B | Llama 3.1 405B | Llama 4 Maverick |
|---|---|---|---|---|
| MMLU-Pro | 74.3 | 68.9 | 73.4 | 80.5 |
| GPQA Diamond | 57.2 | 50.5 | 49.0 | 69.8 |
| LiveCodeBench | 32.8 | 33.3 | 27.7 | 43.4 |
| MGSM (Multilingual) | 90.6 | 91.1 | 91.6 | 92.3 |
| MMMU (Vision) | 73.4 | - | 69.4 | - |
| MathVista (Vision) | 73.7 | - | 70.7 | - |
| DocVQA (test) | 94.4 | - | 94.4 | - |
| ChartQA | 90.0 | 88.8 | 90.0 | - |
The benchmark picture for Scout is mixed. On vision tasks, it is genuinely strong - MMMU 73.4, MathVista 73.7, DocVQA 94.4 are all competitive or best-in-class for open-weight models. The multimodal early-fusion approach pays dividends here. On text reasoning, MMLU-Pro at 74.3 is a solid improvement over the Llama 3.1 405B's 73.4 while using a fraction of the active parameters (17B vs 405B).
The coding benchmarks are where Scout shows its age and its limitations. LiveCodeBench at 32.8% is below Llama 3.3 70B's 33.3% - meaning Scout actually regressed on code generation despite being a newer, larger model. GPQA Diamond at 57.2% is respectable but trails the bigger Maverick by over 12 points. For serious coding workloads, Scout is not the right tool.
Key Capabilities
The 10M context window is Scout's killer feature and the reason to choose it over smaller, faster alternatives. If your workload involves processing entire codebases, legal document sets, or long-form content, no other open-weight model offers this capability. Meta's testing shows perfect needle-in-a-haystack retrieval at all depths, though real-world performance on reasoning tasks at extreme context lengths is a different question that independent benchmarks have not fully answered.
Native multimodal support means Scout can analyze documents, charts, screenshots, and photographs without a separate vision pipeline. DocVQA at 94.4% and ChartQA at 90.0% indicate strong practical utility for document understanding workflows. The model handles up to 5 images per conversation in the instruction-tuned variant, which covers most document analysis scenarios.
On the deployment side, the 109B total parameters require approximately 218GB in BF16 - too large for a single GPU. With INT4 quantization, this drops to roughly 27GB plus overhead, fitting on a single H100 or high-end consumer GPU. FP8 quantized weights are available from Meta directly. Multiple API providers (DeepInfra, Lambda, Groq, SambaNova) offer hosted inference with throughput ranging from 69 to 776 tokens per second depending on the platform and quantization.
Pricing and Availability
Scout is available under the Llama 4 Community License - free for commercial use up to 700 million monthly active users, after which you need a separate agreement with Meta. Weights are on HuggingFace and through Meta's official channels.
API pricing across providers is competitive. DeepInfra offers $0.08/M input and $0.30/M output tokens. Groq charges $0.11/$0.34. Lambda matches DeepInfra's pricing. These are among the cheapest rates for a model with 17B active parameters and multimodal capability.
Strengths
- 10M token context window - unmatched in the open-weight space
- Strong vision benchmarks (DocVQA 94.4, ChartQA 90.0, MMMU 73.4) through native early fusion
- 17B active parameters from 109B total - efficient inference per token
- Fits on single GPU with INT4 quantization
- 200 pre-training languages with 12 natively supported post-training
- Cheap API access across multiple providers
Weaknesses
- LiveCodeBench (32.8) actually trails the older Llama 3.3 70B (33.3) - coding is not Scout's strength
- Knowledge cutoff of August 2024 is over 18 months old
- Llama 4 Community License is not truly open - 700M MAU cap and attribution requirements
- The full 109B must reside in memory even though only 17B activates per token
- GPQA Diamond (57.2) and coding benchmarks lag significantly behind Maverick and Claude Opus 4.6
- Self-reported benchmarks from Meta - initial community reception was skeptical of some claimed numbers
Related Coverage
- Llama 4 Maverick
- Open Source LLM Leaderboard
- Coding Benchmarks Leaderboard
- Open Source vs Proprietary AI Guide
Sources:
