Name: ERNIE 5.0
Author: Baidu

Baidu's ERNIE 5.0 is the most ambitious model the Chinese search giant has shipped yet - a trillion-parameter system that processes text, images, audio, and video not through bolted-on adapters but through a single unified architecture trained on all four modalities together from day one.

TL;DR

Strongest on document and chart tasks: ChartQA 87.8%, outperforming GPT-5 High (78.2%) and Gemini 2.5 Pro (76.3%)
2.4 trillion parameters, under 3% active per inference, 128K context at $0.85/M input tokens
LMArena text score 1460, ranked #8 globally and first among Chinese models as of January 2026 - but its benchmark numbers are vendor-published and instruction following has real-world bugs

Announced at Baidu World 2025 on November 13, 2025 and made officially available January 22, 2026, ERNIE 5.0 represents Baidu's direct counter to Gemini 3.1 Pro and GPT-5 in multimodal reasoning. The competitive framing is aggressive: Baidu ran head-to-head comparisons across 40-plus benchmarks, claiming wins or ties on most. Independently verified results are still thin, but enough data has surfaced to evaluate the core claims.

The model sits in a crowded Chinese frontier cluster alongside GLM-5, DeepSeek V4, and Qwen 3.6 Max. What sets ERNIE 5.0 apart is the omni-modal claim - not just multimodal input, but unified generation across all four media types without separate decoders stitched together after training.

Key Specifications

Specification	Details
Provider	Baidu
Model Family	ERNIE
Parameters	~2.4 trillion total (estimated); ~72 billion active per inference
Context Window	128K tokens input / 64K tokens output
Input Price	$0.85/M tokens
Output Price	$3.40/M tokens
Release Date	January 22, 2026 (preview: November 13, 2025)
License	Proprietary
Open Source	No
Modalities	Text, image, audio, video (input and output)

One important caveat: Baidu hasn't officially confirmed the 2.4 trillion parameter figure. The number comes from third-party reports and is broadly cited across coverage, but the technical paper only states "trillion-parameter scale." The activation rate below 3% per inference is confirmed in the arXiv technical report (2602.04705).

Benchmark Performance

Google DeepMind visualization of multimodal AI processing streams Multimodal AI systems integrate varied input types - ERNIE 5.0's native architecture handles text, images, audio, and video in a single unified model. Source: pexels.com

Baidu's published comparisons show ERNIE 5.0 ahead of GPT-5 High and Gemini 2.5 Pro on visual and document tasks. Numbers from the technical report and corroborating third-party sources:

Benchmark	ERNIE 5.0	GPT-5 High	Gemini 2.5 Pro
MMLU-Pro	83.80	~86	~85
HumanEval+	94.48	-	-
IFEval	93.35	-	-
SimpleQA	74.01	-	69.33
ChartQA	87.80	78.2	76.3
MathVista	82.5	81.3	82.3
LMArena Text Elo	1460 (#8 global)	-	-

MMLU-Pro at 83.80 is competitive but not the top of the chart. The story is different for structured visual tasks: ChartQA at 87.8 is nearly ten points ahead of GPT-5 High's 78.2, and the gap over Gemini 2.5 Pro is similarly wide. For practitioners who work with PDFs, spreadsheets, and scanned documents, that margin is meaningful.

HumanEval+ at 94.48 is a strong coding number, though this comes from Baidu's own evaluation run. The coding benchmarks leaderboard uses independent evaluation pipelines, and independent HumanEval+ numbers for ERNIE 5.0 aren't yet available. Treat the coding scores as a starting point, not a settled ranking.

The LMArena score of 1460 is crowdsourced human preference data - less gameable than static benchmarks - and reaching #8 globally in January 2026 is a credible signal. Still, LMArena skews toward Chinese-language users for Baidu models, which can inflate Chinese-first systems.

Architecture and Design

Extreme close-up of a CPU chip showing circuit pathways ERNIE 5.0's sparse MoE design activates fewer than 72 billion of its 2.4 trillion parameters per inference, delivering frontier performance without proportional compute cost. Source: pexels.com

The technical paper describes ERNIE 5.0's core as a "unified autoregressive framework" with "modality-agnostic expert routing." Rather than separate encoders for each media type, the model routes all input tokens - whether from text, image patches, audio codecs, or video frames - through the same sparse expert pool.

The "Next-Group-of-Tokens Prediction" objective is the engine behind this. It combines standard next-token prediction for text with Next-Frame-and-Scale Prediction for vision and Next-Codec Prediction for audio. Training jointly on these objectives from scratch avoids the alignment tax you get when you attach vision modules to a text-only backbone after the fact.

Three elastic training dimensions stand out as genuinely novel:

Elastic Depth: full-depth computation 75% of the time; reduced-depth 25%. This produces a model that can adapt its compute budget at inference time.
Elastic Width: standard expert routing 80%, subset routing 20%. Useful for running the model at reduced memory footprint.
Elastic Sparsity: standard top-k routing 80%, reduced top-k 20%.

Together, these give ERNIE 5.0 a claimed 53.7% parameter activation at a "near-full performance" operating point, and as low as under 3% activation in maximum-efficiency mode. That's relevant for enterprise deployments where GPU cost scales with active parameters.

The model was trained using Baidu's PaddlePaddle framework on a custom parallelism stack: 4-way tensor, 12-way pipeline, 64-way expert, and data parallelism. FP8 mixed-precision with dynamic adaptive offloading. The FlashMask attention implementation reportedly delivers 200% speedup over FlexAttention for heterogeneous masking patterns.

Key Capabilities

ERNIE 5.0's clearest advantage is in document understanding. OCRBench, DocVQA, and ChartQA all favor models that handle mixed text-and-image reasoning rather than treating them separately. The native architecture has a structural advantage here: the same expert pool that processes text also processes the layout, typography, and graphical elements of a document without a lossy translation step.

On coding, HumanEval+ at 94.48 is competitive at the frontier. The model also shows strong instruction-following on IFEval (93.35), though the real-world instruction compliance story is more complicated - see Weaknesses below.

Audio understanding matches Gemini 3.1 Pro in Baidu's internal tests for speech recognition, though independent audio benchmarks for ERNIE 5.0 are not yet public. The model uses hierarchical codec-based audio tokenization with knowledge distilled from Whisper for semantic enrichment - a reasonable approach given Whisper's strong multilingual speech recognition baseline.

For Chinese-language tasks, ERNIE 5.0 leads by a significant margin on ChineseSimpleQA (86.03). Baidu's training data includes substantially more high-quality Chinese text than any Western lab's dataset, which matters for professional tasks in Mandarin.

Pricing and Availability

Consumer access is free via the ERNIE Bot at ernie.baidu.com. Enterprise and developer access goes through Baidu AI Cloud's Qianfan MaaS platform.

API pricing is $0.85 per million input tokens and $3.40 per million output tokens. For context: DeepSeek V4 undercuts this clearly, and Qwen 3.6 Max also runs cheaper. ERNIE 5.0 is priced at the premium end of Chinese frontier offerings, roughly matching mid-tier Western API pricing rather than the aggressively low-cost positioning that DeepSeek popularized.

The critical limitation for non-Chinese users is API access. The Qianfan platform works globally in principle, but onboarding requires a Chinese business registration for full enterprise access. Consumer access via ERNIE Bot is available internationally. A handful of third-party aggregators like Overchat AI offer ERNIE 5.0 access outside China with no registration barriers, though at unknown markup.

At $0.85/$3.40 per million tokens, ERNIE 5.0 sits above DeepSeek and Qwen on price but still undercuts GPT-5 High by a factor of 10 or more.

Strengths and Weaknesses

Strengths

Native omni-modal architecture - no post-hoc fusion, all four modalities trained jointly from the start
Best-in-class on chart and document understanding benchmarks (ChartQA 87.8%)
Efficient MoE design: under 3% parameter activation per inference reduces cost at scale
Elastic compute dimensions allow adaptive depth and width at runtime
Strong Chinese-language performance, the best among frontier models on ChineseSimpleQA
Free access via ERNIE Bot for testing and light workloads

Weaknesses

Almost all benchmark results are self-reported by Baidu; independent replication is still limited as of April 2026
Known instruction-following bug: the model repeatedly calls tools after being told not to, confirmed by Baidu as "a known bug"
Full API access outside China is complicated by onboarding friction on Qianfan
Parameter count (2.4T) is estimated from third-party reports, not confirmed by Baidu
Priced above Chinese competitors - DeepSeek and Qwen offer comparable text performance for less
No open-source release; full model weights aren't available

GLM-5 - direct domestic competitor from Tsinghua's team
DeepSeek V4 - aggressive pricing challenger from a Baidu-adjacent Chinese lab
Qwen 3.6 Max - Alibaba's flagship competing on both price and quality
Gemini 3.1 Pro - Western comparison point ERNIE 5.0 directly targets in benchmarks
Baidu Loses $11B as AI Hype Gap Widens - financial context for understanding Baidu's competitive pressure
Chatbot Arena ELO Rankings - independent crowdsourced leaderboard where ERNIE 5.0 placed #8

FAQ

Is ERNIE 5.0 available outside China?

ERNIE Bot is accessible globally for free consumer use. Full enterprise API access through Qianfan requires a Chinese business registration. Third-party aggregators provide workarounds, but direct Qianfan API access outside China is limited.

How does ERNIE 5.0 compare to DeepSeek V4?

ERNIE 5.0 wins on visual and multimodal tasks (ChartQA, DocVQA) but costs more per token. DeepSeek V4 is stronger for pure text tasks at lower pricing. ERNIE 5.0 is the better choice for document-heavy workflows; DeepSeek for cost-sensitive text generation.

The model was trained on text, images, audio, and video simultaneously from the beginning, using a single shared expert pool. This differs from models that add vision or audio adapters to a text backbone after training, which creates a lossy modality boundary.

Are the benchmark scores verified?

The technical report (arXiv 2602.04705) provides the most detailed figures, but it's authored by Baidu researchers. The LMArena text score of 1460 is crowdsourced and harder to manipulate. Independent benchmark replication from third-party labs is still pending as of April 2026.

What is the context window?

ERNIE 5.0 supports 128K tokens of input and 64K tokens of output, per the technical report's Stage 2 context extension.

Does ERNIE 5.0 support video generation?

Yes. VBench semantic score of 83.40 and GenEval image score of 90.1 are in the technical report, both competitive with top-tier generation models. Video generation is available in the ERNIE Bot consumer interface.

Sources: