Models

Llama 4 Scout

Meta's Llama 4 Scout is a 109B-total, 17B-active MoE model with 16 experts and a 10M-token context window - the longest of any open-weight model - with native multimodal support for text and images.

Llama 4 Scout

Llama 4 Scout is Meta's entry-level model in the Llama 4 family, and "entry-level" is doing a lot of heavy lifting when you are talking about 109 billion total parameters, 17 billion active, and a 10 million token context window. Released in April 2025, Scout was Meta's first MoE architecture in the Llama lineup - 16 experts with 2 active per forward pass - and it shipped with native multimodal capabilities baked in through early fusion rather than bolted on after the fact.

TL;DR

  • 109B total / 17B active MoE with 16 experts (2 active per token) - natively multimodal (text + image)
  • 10M token context window - longest of any open-weight model at launch
  • MMLU-Pro 74.3%, GPQA Diamond 57.2%, LiveCodeBench 32.8%
  • Fits on a single H100 with INT4 quantization (~27GB + overhead)
  • Llama 4 Community License - free for up to 700M monthly active users

Overview

Scout's 10M token context window was the headline number at launch, and it remains the defining feature. Meta claims perfect retrieval accuracy across all depths within that window - meaning the model can find a needle anywhere in a 10-million-token haystack. For RAG pipelines, document analysis, or codebases that span thousands of files, this is a fundamentally different capability tier than the 128K-256K windows most competitors offer.

The architecture uses a standard MoE approach - 16 experts per MoE layer, 2 routed per token - giving 17B active parameters from the 109B total. The multimodal integration uses Meta's enhanced MetaCLIP vision encoder, which converts visual tokens and feeds them through early fusion so the Transformer attends to text and image tokens jointly from the first layer. This is architecturally cleaner than adapter-based approaches, though it means the model was trained multimodal from scratch rather than fine-tuned onto a text-only backbone.

Training consumed approximately 40 trillion tokens across text, image, and video data, with 5 million GPU hours on H100-80GB hardware. The model supports 12 languages natively (Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, Vietnamese) with pre-training coverage across 200 languages. The knowledge cutoff is August 2024 - worth noting since this is now over a year old.

Key Specifications

SpecificationDetails
ProviderMeta
Model FamilyLlama 4
ArchitectureMixture-of-Experts with Early Fusion
Total Parameters109B
Active Parameters17B
Experts16 (2 active per token)
Context Window10M tokens
Training Data~40T tokens (text, image, video)
Training Compute5.0M GPU hours (H100-80GB)
Knowledge CutoffAugust 2024
Input ModalitiesText, Image
Output ModalitiesText, Code
Supported Languages12 (200 in pre-training)
QuantizationBF16, FP8, INT4 available
Release DateApril 5, 2025
LicenseLlama 4 Community License

Benchmark Performance

BenchmarkLlama 4 ScoutLlama 3.3 70BLlama 3.1 405BLlama 4 Maverick
MMLU-Pro74.368.973.480.5
GPQA Diamond57.250.549.069.8
LiveCodeBench32.833.327.743.4
MGSM (Multilingual)90.691.191.692.3
MMMU (Vision)73.4-69.4-
MathVista (Vision)73.7-70.7-
DocVQA (test)94.4-94.4-
ChartQA90.088.890.0-

The benchmark picture for Scout is mixed. On vision tasks, it is genuinely strong - MMMU 73.4, MathVista 73.7, DocVQA 94.4 are all competitive or best-in-class for open-weight models. The multimodal early-fusion approach pays dividends here. On text reasoning, MMLU-Pro at 74.3 is a solid improvement over the Llama 3.1 405B's 73.4 while using a fraction of the active parameters (17B vs 405B).

The coding benchmarks are where Scout shows its age and its limitations. LiveCodeBench at 32.8% is below Llama 3.3 70B's 33.3% - meaning Scout actually regressed on code generation despite being a newer, larger model. GPQA Diamond at 57.2% is respectable but trails the bigger Maverick by over 12 points. For serious coding workloads, Scout is not the right tool.

Key Capabilities

The 10M context window is Scout's killer feature and the reason to choose it over smaller, faster alternatives. If your workload involves processing entire codebases, legal document sets, or long-form content, no other open-weight model offers this capability. Meta's testing shows perfect needle-in-a-haystack retrieval at all depths, though real-world performance on reasoning tasks at extreme context lengths is a different question that independent benchmarks have not fully answered.

Native multimodal support means Scout can analyze documents, charts, screenshots, and photographs without a separate vision pipeline. DocVQA at 94.4% and ChartQA at 90.0% indicate strong practical utility for document understanding workflows. The model handles up to 5 images per conversation in the instruction-tuned variant, which covers most document analysis scenarios.

On the deployment side, the 109B total parameters require approximately 218GB in BF16 - too large for a single GPU. With INT4 quantization, this drops to roughly 27GB plus overhead, fitting on a single H100 or high-end consumer GPU. FP8 quantized weights are available from Meta directly. Multiple API providers (DeepInfra, Lambda, Groq, SambaNova) offer hosted inference with throughput ranging from 69 to 776 tokens per second depending on the platform and quantization.

Pricing and Availability

Scout is available under the Llama 4 Community License - free for commercial use up to 700 million monthly active users, after which you need a separate agreement with Meta. Weights are on HuggingFace and through Meta's official channels.

API pricing across providers is competitive. DeepInfra offers $0.08/M input and $0.30/M output tokens. Groq charges $0.11/$0.34. Lambda matches DeepInfra's pricing. These are among the cheapest rates for a model with 17B active parameters and multimodal capability.

Strengths

  • 10M token context window - unmatched in the open-weight space
  • Strong vision benchmarks (DocVQA 94.4, ChartQA 90.0, MMMU 73.4) through native early fusion
  • 17B active parameters from 109B total - efficient inference per token
  • Fits on single GPU with INT4 quantization
  • 200 pre-training languages with 12 natively supported post-training
  • Cheap API access across multiple providers

Weaknesses

  • LiveCodeBench (32.8) actually trails the older Llama 3.3 70B (33.3) - coding is not Scout's strength
  • Knowledge cutoff of August 2024 is over 18 months old
  • Llama 4 Community License is not truly open - 700M MAU cap and attribution requirements
  • The full 109B must reside in memory even though only 17B activates per token
  • GPQA Diamond (57.2) and coding benchmarks lag significantly behind Maverick and Claude Opus 4.6
  • Self-reported benchmarks from Meta - initial community reception was skeptical of some claimed numbers

Sources:

Llama 4 Scout
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.