Name: Llama 4 Scout
Author: Meta

Llama 4 Scout is Meta's entry-level model in the Llama 4 family, and "entry-level" is doing a lot of heavy lifting when you are talking about 109 billion total parameters, 17 billion active, and a 10 million token context window. Released in April 2025, Scout was Meta's first MoE architecture in the Llama lineup - 16 experts with 2 active per forward pass - and it shipped with native multimodal capabilities baked in through early fusion rather than bolted on after the fact.

TL;DR

109B total / 17B active MoE with 16 experts (2 active per token) - natively multimodal (text + image)
10M token context window - longest of any open-weight model at launch
MMLU-Pro 74.3%, GPQA Diamond 57.2%, LiveCodeBench 32.8%
Fits on a single H100 with INT4 quantization (~27GB + overhead)
Llama 4 Community License - free for up to 700M monthly active users

Overview

Scout's 10M token context window was the headline number at launch, and it remains the defining feature. Meta claims perfect retrieval accuracy across all depths within that window - meaning the model can find a needle anywhere in a 10-million-token haystack. For RAG pipelines, document analysis, or codebases that span thousands of files, this is a fundamentally different capability tier than the 128K-256K windows most competitors offer.

The architecture uses a standard MoE approach - 16 experts per MoE layer, 2 routed per token - giving 17B active parameters from the 109B total. The multimodal integration uses Meta's enhanced MetaCLIP vision encoder, which converts visual tokens and feeds them through early fusion so the Transformer attends to text and image tokens jointly from the first layer. This is architecturally cleaner than adapter-based approaches, though it means the model was trained multimodal from scratch rather than fine-tuned onto a text-only backbone.

Training consumed roughly 40 trillion tokens across text, image, and video data, with 5 million GPU hours on H100-80GB hardware. The model supports 12 languages natively (Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, Vietnamese) with pre-training coverage across 200 languages. The knowledge cutoff is August 2024 - worth noting since this is now over a year old.

Key Specifications

Specification	Details
Provider	Meta
Model Family	Llama 4
Architecture	Mixture-of-Experts with Early Fusion
Total Parameters	109B
Active Parameters	17B
Experts	16 (2 active per token)
Context Window	10M tokens
Training Data	~40T tokens (text, image, video)
Training Compute	5.0M GPU hours (H100-80GB)
Knowledge Cutoff	August 2024
Input Modalities	Text, Image
Output Modalities	Text, Code
Supported Languages	12 (200 in pre-training)
Quantization	BF16, FP8, INT4 available
Release Date	April 5, 2025
License	Llama 4 Community License

Benchmark Performance

Benchmark	Llama 4 Scout	Llama 3.3 70B	Llama 3.1 405B	Llama 4 Maverick
MMLU-Pro	74.3	68.9	73.4	80.5
GPQA Diamond	57.2	50.5	49.0	69.8
LiveCodeBench	32.8	33.3	27.7	43.4
MGSM (Multilingual)	90.6	91.1	91.6	92.3
MMMU (Vision)	73.4	-	69.4	-
MathVista (Vision)	73.7	-	70.7	-
DocVQA (test)	94.4	-	94.4	-
ChartQA	90.0	88.8	90.0	-

The benchmark picture for Scout is mixed. On vision tasks, it's genuinely strong - MMMU 73.4, MathVista 73.7, DocVQA 94.4 are all competitive or best-in-class for open-weight models. The multimodal early-fusion approach pays dividends here. On text reasoning, MMLU-Pro at 74.3 is a solid improvement over the Llama 3.1 405B's 73.4 while using a fraction of the active parameters (17B vs 405B).

The coding benchmarks are where Scout shows its age and its limitations. LiveCodeBench at 32.8% is below Llama 3.3 70B's 33.3% - meaning Scout actually regressed on code generation despite being a newer, larger model. GPQA Diamond at 57.2% is respectable but trails the bigger Maverick by over 12 points. For serious coding workloads, Scout isn't the right tool.

Key Capabilities

The 10M context window is Scout's killer feature and the reason to choose it over smaller, faster alternatives. If your workload involves processing entire codebases, legal document sets, or long-form content, no other open-weight model offers this capability. Meta's testing shows perfect needle-in-a-haystack retrieval at all depths, though real-world performance on reasoning tasks at extreme context lengths is a different question that independent benchmarks haven't fully answered.

Native multimodal support means Scout can analyze documents, charts, screenshots, and photographs without a separate vision pipeline. DocVQA at 94.4% and ChartQA at 90.0% point to strong practical utility for document understanding workflows. The model handles up to 5 images per conversation in the instruction-tuned variant, which covers most document analysis scenarios.

On the deployment side, the 109B total parameters require around 218GB in BF16 - too large for a single GPU. With INT4 quantization, this drops to roughly 27GB plus overhead, fitting on a single H100 or high-end consumer GPU. FP8 quantized weights are available from Meta directly. Multiple API providers (DeepInfra, Lambda, Groq, SambaNova) offer hosted inference with throughput ranging from 69 to 776 tokens per second depending on the platform and quantization.

Pricing and Availability

Scout is available under the Llama 4 Community License - free for commercial use up to 700 million monthly active users, after which you need a separate agreement with Meta. Weights are on HuggingFace and through Meta's official channels.

API pricing across providers is competitive. DeepInfra offers $0.08/M input and $0.30/M output tokens. Groq charges $0.11/$0.34. Lambda matches DeepInfra's pricing. These are among the cheapest rates for a model with 17B active parameters and multimodal capability.

Strengths

10M token context window - unmatched in the open-weight space
Strong vision benchmarks (DocVQA 94.4, ChartQA 90.0, MMMU 73.4) through native early fusion
17B active parameters from 109B total - efficient inference per token
Fits on single GPU with INT4 quantization
200 pre-training languages with 12 natively supported post-training
Cheap API access across multiple providers

Weaknesses

LiveCodeBench (32.8) actually trails the older Llama 3.3 70B (33.3) - coding is not Scout's strength
Knowledge cutoff of August 2024 is over 18 months old
Llama 4 Community License isn't truly open - 700M MAU cap and attribution requirements
The full 109B must reside in memory even though only 17B activates per token
GPQA Diamond (57.2) and coding benchmarks lag clearly behind Maverick and Claude Opus 4.6
Self-reported benchmarks from Meta - initial community reception was skeptical of some claimed numbers

Sources: