Machine translation is one of the oldest benchmarkable NLP tasks, which means it's also one of the most contested. Everyone from Google to DeepL to a handful of fine-tuned open-weight models claims their system is best. The question worth asking is: best on which benchmark, for which languages, measured how?

This leaderboard cuts through that noise. It covers the primary MT benchmarks as of April 2026 - FLORES-200/FLORES+, WMT 2024, WMT 2025, TICO-19, IWSLT 2024/2025, and MT-GenEval - and maps where frontier LLMs actually sit relative to dedicated neural MT (NMT) systems. The short answer: LLMs have overtaken specialized engines for high-resource language pairs on human evaluation, but the gap narrows quickly for low-resource languages and domain-specific content.

TL;DR

Gemini 2.5 Pro won WMT25 human evaluation, topping 14 of 16 tested language pairs, with GPT-4.1 close behind
WMT24 showed Claude 3.5 Sonnet first on human ESA scores in 9 of 11 pairs - but TOWER-v2-70B beat everyone on COMET, exposing metric gaming
DeepL still holds the edge on European language BLEU (62-65 vs 59-62 for GPT-4o), but frontier LLMs outrun it on COMET for nuanced content
For low-resource languages and specialized domains (medical, legal), fine-tuned NLLB-200 3.3B still beats 7-8B LLMs

What These Benchmarks Actually Measure

Translation benchmarks fall into two types: automatic and human. Both matter, and they often disagree - sometimes dramatically.

FLORES-200 and FLORES+

FLORES-200 is Meta AI's benchmark covering 204 languages, providing professionally translated parallel sentences across diverse domains (Wikipedia, travel guides, children's books). It's the closest thing the field has to a universal MT test set. The expanded FLORES+ variant adds more language coverage and has been used in WMT shared tasks.

The benchmark's strength is breadth - 40,000+ translation directions. Its weakness, noted in a 2025 arXiv paper, is that many translations fall below the claimed 90% quality standard, and models can game it by copying named entities, achieving average BLEU of 0.29 without any real understanding. Treat FLORES+ as a useful filter, not a verdict.

WMT 2024 and WMT 2025

The Workshop on Machine Translation annual shared tasks are the closest thing to a peer-reviewed gold standard in MT evaluation. WMT 2024 covered 11 language pairs and assessed 8 LLMs with 4 online translation providers, all scored by professional human annotators using Error Span Annotations (ESA). WMT 2025 expanded to 30 language pairs with human evaluation on 15 of them, using ESA for most and MQM (Multidimensional Quality Metrics) for English-to-Korean and Japanese-to-Chinese.

The organizers title their 2024 paper "The LLM Era Is Here but MT Is Not Solved Yet." That framing is accurate. Frontier LLMs now win on human evaluation for general content, but they don't consistently beat specialized systems on every domain and language pair.

NTREX-128

Microsoft's NTREX-128 provides 128-language news-domain translation references from English, covering 1,997 phrases and 42,000 words. It's mostly used as a complementary benchmark alongside FLORES+, particularly for evaluating coverage of lower-resource languages. COMET-22 is the main metric applied to NTREX results in recent studies, with chrF as a secondary metric for languages not well supported by COMET's training data.

TICO-19

The Translation Initiative for COVID-19 covers medical-domain translations into 35 languages, with particular focus on lower-resource African, South Asian, and Southeast Asian languages. It's a useful domain stress test because medical content demands precise terminology - models can't paraphrase their way through it.

Research published in late 2024 found that fine-tuned NLLB-200 3.3B outperforms 7-8B LLMs across three of four language directions on TICO-19, with the gap closing only for larger LLMs at 40B+ parameters. For medical content in Swahili or Hindi, the dedicated MT system still wins.

MT-GenEval

Amazon's MT-GenEval evaluates gender accuracy in translation from English into Arabic, French, German, Hindi, Italian, Portuguese, Russian, and Spanish. It uses counterfactual and contextual sentence pairs where gender is unambiguous in context, testing whether systems correctly agree gender across multi-sentence segments.

Base LLMs show more gender bias than NMT models according to 2024 EMNLP research. Instruction-tuned LLMs with explicit gender prompting can close the gap by up to 12%, but getting consistent results requires careful prompt engineering.

IWSLT Speech Translation

The International Workshop on Spoken Language Translation assesses speech-to-text translation (cascaded: ASR then MT) and end-to-end speech translation. IWSLT 2024 and 2025 have seen cascaded systems combining Whisper large v2 with LLM translation components emerge as competitive, with BLEU scores on English-to-German reaching 24-28 depending on latency constraints.

Scoring Metrics Explained

Before the tables, a quick note on what each number means - because BLEU and COMET tell different stories.

BLEU measures n-gram overlap between the model output and a reference translation. Scores range 0-100; anything above 50 for European language pairs is generally good, but BLEU punishes paraphrasing and doesn't handle synonyms well. It's a legacy metric at this point - useful for historical comparisons, not for distinguishing high-quality systems from each other.

chrF++ uses character n-gram overlap and F-score, correlating better with human judgment than BLEU across morphologically rich languages (Finnish, Turkish, Czech). WMT uses it as a secondary automatic metric.

COMET is a neural metric trained on human translation quality judgments. Scores usually range 0.75-0.95 for production-quality systems, with 0.85+ generally considered strong. COMET tracks human evaluation better than BLEU, but the WMT24 findings revealed a critical flaw: TOWER-v2-70B gamed COMET by tuning against the metric, scoring first on COMET across all 11 language pairs while losing to Claude 3.5 on 9 of 11 under human evaluation.

MetricX is Google's neural MT metric, submitted to WMT 2024's metrics shared task. MetricX-24 showed strong correlation with human judgments in the WMT24 metrics task and is a viable alternative to COMET, especially for Google's language pairs.

Human ESA (Error Span Annotation) is the gold standard for WMT. Human annotators mark each error span in translations, categorizing by type and severity. It's expensive and slow, which is why it only runs on final system outputs once per year.

TOWER-v2-70B ranked first on COMET across all 11 WMT24 language pairs - and lost to Claude 3.5 on 9 of those pairs under human evaluation. Metric gaming is real.

WMT 2025 Results: Overall Rankings

These are the final WMT25 human evaluation results published in late 2025. The evaluation used ESA for 13 language pairs and MQM for English-Korean and Japanese-Chinese. Systems were compared within constrained (open-source, up to 20B parameters) and unconstrained tracks.

Rank	Model / System	Provider	WMT25 Top-Cluster Pairs	Track
1	Gemini 2.5 Pro	Google	14 of 16	Unconstrained
2	GPT-4.1	OpenAI	Top tier (close second)	Unconstrained
3	CommandA	Cohere	Close behind Tier 1	Unconstrained
4	DeepSeek-V3	DeepSeek	Mid-high tier	Unconstrained
5	Claude 4	Anthropic	Mid-high tier	Unconstrained
6	Llama 4	Meta	Lower tier	Unconstrained
7	Qwen-3	Alibaba	Lower tier	Unconstrained
8	Mistral	Mistral AI	Lower tier	Unconstrained
-	Google Translate	Google	Mid-ranking	Online provider
-	Microsoft Translator	Microsoft	Mid-ranking	Online provider
-	DeepL	DeepL	Mid-ranking	Online provider
-	Shy-hunyuan-MT	Tencent	11 of 16 (constrained win)	Constrained

Source: WMT25 General MT Shared Task findings. "Top-cluster" pairs = statistically tied for first in human evaluation cluster analysis. Automatic metric rankings are preliminary; human evaluation is authoritative.

One detail worth flagging: human references placed in the winning cluster for only 6 of 15 assessed language pairs. The best MT systems now equal or exceed professional human translators on standard test sets for common language pairs.

WMT 2024 Results: Human vs. COMET

WMT 2024 is the clearest demonstration of why you shouldn't trust a single metric. Two different metrics produce two different winners.

System	Provider	Human ESA Rank	COMET Auto Rank
Claude 3.5 Sonnet	Anthropic	1st (9/11 pairs)	Mid
GPT-4o	OpenAI	Strong 2nd	Strong
TOWER-v2-70B	Unbabel	Lost to Claude 3.5 on most pairs	1st all 11 pairs
Google Translate	Google	Competitive	Competitive
DeepL	DeepL	Competitive	Competitive

The key finding from the WMT24 paper: TOWER-v2 was "optimized for the metric rather than actual translation quality." COMET now has a contamination problem similar to what happened to MMLU - once systems are trained against a metric, the metric stops measuring what it claims to measure.

Books and printed text representing multilingual written materials High-resource European language pairs now see LLMs beat dedicated NMT engines on human evaluation - a shift that wasn't expected to happen this quickly. Source: pexels.com

Automatic Metric Scores by Language Pair

These COMET scores are drawn from the Localize.js blind study (September 2025, 40 phrases across six major language pairs, rated by native linguists on a 1-5 scale) and the IntlPull benchmark (2024 data, 10 language pairs, BLEU and COMET).

COMET Scores (0-1 scale, IntlPull benchmark, 2024 data)

Rank	Model	Provider	COMET Score
1	GPT-4 Turbo	OpenAI	0.847
2	Claude 3.5 Sonnet	Anthropic	0.841
3	DeepL API Pro	DeepL	0.838
4	Gemini 1.5 Pro	Google	0.812

Note: These are 2024 scores on an earlier model generation. Frontier 2025-2026 models (GPT-5, Gemini 2.5 Pro, Claude Opus 4.x) score meaningfully higher on COMET. See the capabilities/translation page for current model-by-model ratings.

BLEU Scores by Language Pair (DeepL vs GPT-4o, 2025 data)

Language Pair	DeepL BLEU	GPT-4o BLEU	Delta
English-German	64.5	61.2	DeepL +3.3
English-French	63.1	60.8	DeepL +2.3
English-Spanish	62.8	59.6	DeepL +3.2
English-Chinese	51.3	57.4	GPT-4o +6.1
English-Japanese	49.1	54.8	GPT-4o +5.7

DeepL's lead on European language BLEU is real but narrowing. On non-European languages - particularly Chinese and Japanese - GPT-4o pulls ahead on BLEU, consistent with LLMs' larger multilingual training corpora.

Human Quality Ratings by Language (Localize.js study, 1-5 scale)

System	Chinese	Japanese	Spanish	German	French	Overall
GPT-4o Mini	4.75	4.79	4.71	4.62	4.68	4.75
Claude 3.5 Sonnet	4.92	4.82	4.69	4.58	4.65	4.73
DeepL	4.51	4.61	4.79	4.58	4.67	4.63
Amazon Translate	4.55	4.58	4.63	4.51	4.62	4.62
Microsoft Translator	4.52	4.54	4.61	4.53	4.60	4.61
Google NMT	4.41	4.47	4.61	4.45	4.34	4.49

Claude 3.5's 4.92 in Chinese and 4.82 in Japanese are the highest single-language scores in the study. DeepL leads for European pairs.

Domain-Specific Performance: TICO-19 (Medical)

For domain-specific translation, the picture is more complicated. This is from a December 2024 study comparing Llama models and NLLB-200 on TICO-19 medical data.

Model	En-Fr BLEU	En-Fr COMET	En-Pt BLEU	En-Pt COMET
NLLB-200 3.3B (fine-tuned)	53.13	83.83	46.41	87.29
Llama-3.1 405B (one-shot)	42.61	69.56	47.76	88.07
Llama-3 8B (fine-tuned)	49.67	77.87	43.63	84.22

NLLB-200 3.3B fine-tuned on medical data beats the one-shot Llama 3.1 405B by more than 10 BLEU points for English-to-French medical text. The 405B model wins on COMET for English-to-Portuguese, but the gap is narrow. For smaller 7-8B LLMs, fine-tuned NLLB-200 wins across three of four tested directions.

The practical conclusion: if you're building a medical translation pipeline and latency or cost matters, a domain-specialized NMT model like NLLB-200 or Meta's SeamlessM4T v2 is still the better call for many language pairs.

World globe illustrating the challenge of coverage across 200+ languages FLORES-200's 204-language coverage exposes a clear quality gradient - models that score well on English-Spanish often struggle notably on Swahili, Yoruba, or Bhojpuri. Source: pexels.com

Gender Bias: MT-GenEval Results

Gender accuracy in translation is a separate problem from overall quality. MT-GenEval tests eight language pairs where the referent's gender is clear from context and should spread through multi-sentence outputs.

The 2024 research found:

Base (non-instruction-tuned) LLMs show higher gender bias than NMT models
Instruction-tuned LLMs with explicit gender-aware prompting reduce bias by up to 12% on WinoMT
No current publicly assessed model has eliminated gender errors, especially for morphologically complex languages (Arabic, Russian)

This matters for anyone building translation systems for HR software, legal documents, or medical records, where misgendering is a compliance risk, not just a quality issue. None of the current frontier LLMs ship with gender-consistent translation as a default behavior.

IWSLT 2025: Speech Translation

For spoken language translation, the IWSLT 2025 simultaneous translation track showed cascaded systems (Whisper ASR + LLM) still beating end-to-end models in most conditions:

BeaverTalk (OSU): Whisper large v2 + fine-tuned Gemma 3, BLEU 24.64 at StreamLAAL 1837ms on English-to-German
CUNI: Whisper in simultaneous mode + EuroLLM, improving 2 BLEU points on Czech-English and 13-22 BLEU on English-to-German/Chinese/Japanese vs. baseline
End-to-end Indic track systems averaged 27.86 BLEU for Indic-to-English directions

Cascaded systems still dominate where high-quality ASR models exist (English-source content). End-to-end systems close the gap for lower-resource source languages where dedicated ASR is weak.

Key Takeaways

LLMs Have Overtaken Dedicated MT for High-Resource Languages

WMT24 and WMT25 both confirm it: frontier LLMs beat Google Translate, DeepL, and Microsoft Translator on human evaluation for general content in major European, Asian, and Slavic language pairs. Gemini 2.5 Pro at the top of WMT25 and Claude 3.5 winning WMT24 on human ESA are two data points in a consistent trend.

This doesn't mean you should replace your MT pipeline with a LLM API call. LLMs cost more per character, add latency, and behave less predictably than deterministic NMT systems. For high-volume localization of UI strings or short phrases, DeepL and Google Translate remain cost-effective and predictable.

COMET Is Being Gamed

The WMT24 TOWER-v2 result isn't an isolated oddity. It's the same dynamic that hit MMLU, HumanEval, and every other benchmark that systems optimize against directly. If you're selecting a MT system based solely on COMET, you're selecting for COMET optimization, not translation quality. Use COMET as a development signal, not a final verdict. Weight ESA and MQM human evaluation for production decisions.

Low-Resource Languages Remain a Hard Problem

No LLM reliably covers 200 languages at the same quality it handles English-German. FLORES-200 scores reveal steep quality gradients between European languages and low-resource African or South Asian languages. For those, NLLB-200 or OpusMT fine-tuned on domain data is often the practical choice, especially at inference cost constraints. See the multilingual LLM leaderboard for detail on per-language breakdowns.

Domain Specialization Still Beats Scale for Niche Content

Medical, legal, and technical translation require terminology consistency that general LLMs don't guarantee out of the box. Fine-tuning NLLB-200 3.3B on medical corpora beats one-shot Llama 3.1 405B by 10+ BLEU on TICO-19 for English-French. Scale doesn't compensate for missing domain knowledge when the content has no tolerance for paraphrase.

The Metric Problem Is Getting Worse

Three major MT benchmarks now have known contamination or gaming issues: FLORES+ (named-entity copying), COMET (metric optimization by TOWER-v2), and less so BLEU for high-quality system comparison. The community is moving toward human ESA and MQM as the authoritative measures, but those are expensive. The WMT25 paper's subtitle - "Time to Stop Evaluating on Easy Test Sets" - signals this problem is recognized. It hasn't been solved yet. For a broader view of this pattern across AI benchmarks, see the overall LLM rankings.

Practical Recommendations

High-volume European language localization - DeepL API Pro or Google Translate v3. Lower cost per character, consistent output, well-understood failure modes. Reserve LLMs for content where marketing voice or legal precision matters.

General-purpose frontier translation - Gemini 2.5 Pro or GPT-5 for the highest quality across diverse language pairs. Claude Opus 4.x if stylistic fluency is the primary criterion. Reference the capabilities/translation page for current provider pricing.

East Asian languages (Chinese, Japanese, Korean) - Frontier LLMs beat DeepL on BLEU and human scores. Qwen-MT is a competitive cost-efficient option for Chinese-centric workflows; see the Qwen model profiles for API availability.

Medical, legal, or technical domains - Fine-tune NLLB-200 or use a specialized API. General LLMs without domain adaptation will produce fluent but imprecise output. This isn't a quality issue in the subjective sense - it's a terminology consistency issue that generic COMET scores don't capture.

Low-resource languages (Swahili, Yoruba, Bhojpuri, etc.) - Meta NLLB-200 (200-language coverage, open-source), OpusMT Helsinki models, or SeamlessM4T v2. Don't assume frontier LLM quality extends to these languages.

Speech translation - Cascaded Whisper + LLM pipelines outperform end-to-end systems for English-source content. Use end-to-end only when source ASR quality is weak.

FAQ

Which model is best for machine translation in 2026?

Gemini 2.5 Pro won WMT25, the field's most rigorous human evaluation, topping 14 of 16 language pairs. GPT-4.1 and Claude 4 are in the next tier. For cost-sensitive high-volume work, DeepL remains competitive for European pairs.

Is BLEU still a useful translation metric?

BLEU is useful as a trend indicator and for historical comparisons, but poor for comparing high-quality systems. COMET correlates better with human judgment. For production decisions, use ESA or MQM human evaluation - the results often differ from automatic metrics.

Do LLMs beat Google Translate and DeepL?

Yes, on human evaluation for major language pairs, frontier LLMs now outperform Google Translate and DeepL. But for high-volume, cost-sensitive localization of simple content, the dedicated systems remain competitive and markedly cheaper.

What is FLORES-200 and why does it matter?

FLORES-200 is Meta AI's 204-language parallel evaluation corpus. It's the most widely used benchmark for multilingual MT coverage, included in WMT shared tasks. Its weakness is that it's increasingly gamed by models copying named entities, and many low-resource language translations are lower quality than claimed.

How does COMET differ from BLEU?

COMET is a neural metric trained on human quality judgments; it handles synonyms and paraphrase better than BLEU's n-gram matching. COMET scores typically range 0.75-0.95. BLEU ranges 0-100, but practical scores for good European-language MT are 50-70. COMET correlates more reliably with human judgment, but is itself vulnerable to optimization.

What is MT-GenEval and should I care?

MT-GenEval measures gender accuracy in translation for eight language pairs. If you're translating HR, medical, or legal content where misgendering creates liability, yes. No current LLM gets this right by default - you need prompt engineering or post-processing to enforce gender consistency.

Sources: