Best AI Models for Language Translation - May 2026

TL;DR

Gemini 3.1 Pro leads the OpenMark March 2026 benchmark at 61% and inherits the WMT25 top spot from Gemini 2.5 Pro, at $2/M tokens
GPT-5.5 and Claude Opus 4.7 launched in April 2026 - too late for any formal translation benchmark; ranked by predecessor lineage
WMT26 won't publish results until November 2026, making OpenMark March 2026 the freshest cross-model evaluation available

Gemini 3.1 Pro is the strongest verified choice for broad translation needs as of May 2026. Its predecessor won WMT25 human evaluation across 14 of 16 language pairs, and the OpenMark March 2026 benchmark confirms 3.1 Pro is competitive at the frontier tier. At $2.00 per million input tokens, it undercuts the nearest premium alternatives by 60%.

GPT-5.5 (April 23) and Claude Opus 4.7 launched after the most recent translation benchmarks closed. Both come from model families that led or placed second in the lechmazur round-trip evaluation through September 2025. Until a fresh benchmark re-runs, their translation ranking is inferred from predecessors' scores.

WMT26 is scheduled for Budapest in November 2026 and will produce the next formal human evaluation. Until then, the OpenMark March 2026 results and lechmazur September 2025 data are what's available.

Rankings

Rank	Model	Provider	Score	Price (Input)	Verdict
1	Gemini 3.1 Pro	Google	OpenMark March 2026: 61%	$2.00/M	WMT25 lineage; 2M context; best value at frontier
2	GPT-5.5	OpenAI	Post-benchmark (Apr 23, 2026)	$5.00/M	Predecessor led lechmazur 8.690; widest rare-language ceiling
3	Claude Opus 4.7	Anthropic	Post-benchmark	$5.00/M	Pro translator approval lineage; leads Spanish, Arabic
4	Grok 4.20	xAI	Grok 4 lechmazur: 8.573/10 (Sep 2025)	$1.25/M	Leads Japanese; 4.20 is 58% cheaper than original Grok 4
5	Minimax M2.5 Lightning	MiniMax	OpenMark March 2026: 71% (deterministic)	$0.30/M	OpenMark #1 with methodology caveat; fast throughput
6	DeepSeek V4	DeepSeek	Post-benchmark (Apr 24, 2026)	$1.74/M	Open-weight frontier; self-hostable; CJK improvements
7	Qwen-MT	Alibaba	Competitive with GPT-4.1 on WMT24	$0.50/M (out)	92 languages; CJK specialist; MoE architecture
8	DeepL Pro	DeepL	BLEU: EN-DE 64.5, EN-ES 62.8	$25/M chars	EU pair leader; only 33 languages supported
9	Google Cloud Translate	Google	BLEU 47-55 (EU pairs)	$20/M chars	133 languages; strongest low-resource breadth
10	Meta NLLB-200	Meta	+44% FLORES vs prior SOTA	Free (OSS)	200 languages; only free option for rare pairs

Detailed Analysis

Gemini 3.1 Pro

Released February 19, 2026, Gemini 3.1 Pro inherits a strong translation record from its predecessor. Gemini 2.5 Pro placed at the top of WMT25 human evaluation across 14 of 16 language pairs - the annual academic gold standard run last November. The March 2026 OpenMark benchmark confirms that 3.1 Pro stays competitive, scoring 61% with GPT-5.4 and Kimi K2.5 in an evaluation of 25 models across six languages.

The practical case for Gemini 3.1 Pro is context length at a reasonable price. At $2.00 per million input tokens, it undercuts GPT-5.5 by 60% and handles context up to 2 million tokens - enough to process a full legal agreement without chunking. Inconsistency across long documents is one of the costliest problems in professional translation workflows. See the long-context retrieval capability page for how it handles document-length tasks generally.

For South Asian languages - Hindi, Telugu, Bengali - Gemini models have consistently outperformed GPT-4 variants across two WMT evaluations, a pattern tied to Google's data coverage. Extended context pricing rises to $4.00/M above 200K tokens; at standard context lengths the $2.00 rate makes it the best cost-per-translation-unit among frontier models.

GPT-5.5

OpenAI released GPT-5.5 on April 23, 2026 - two weeks after the most recent translation benchmark data closed. Its predecessor, GPT-5, led the lechmazur round-trip benchmark at 8.690 out of 10 across ten languages, 200 source texts each. GPT-5.5 costs $5.00 per million input tokens and $30.00 output, which is twice what GPT-5.4 cost.

The lechmazur round-trip approach tests English to target language and back, measuring how much meaning and voice survive the round trip. GPT-5 led or placed second in nine of ten languages. Its biggest margin over competitors showed in low-resource languages: Swahili at 8.573 vs Gemini 2.5 Pro's 8.346. GPT-5.5 is OpenAI's strongest released model and is expected to improve on those scores, but no translation-specific evaluation has published results yet. For a full capability profile, see our GPT-5.5 review.

For teams running high-stakes rare-language translation - East African languages, regional South Asian dialects - GPT-5.5 is the defensible choice pending independent translation benchmarks.

Claude Opus 4.7

Claude Opus 4.7 is Anthropic's current flagship, succeeding Claude Opus 4.1, which scored 8.559 on the lechmazur benchmark in September 2025. The Anthropic translation lineage is built on professional translator approval rather than automatic metrics. In a 2025 Lokalise study, Claude 3.5 Sonnet - two generations back - hit a 78% "good" rating from professional translators, the highest of any model tested. The WMT24 human evaluation also showed Claude 3.5 Sonnet winning 9 of 11 language pairs despite TOWER-v2-70B leading all 11 on COMET scores.

Opus 4.1 led Arabic (8.616), Spanish (8.743), and Polish (8.637) head-to-head against GPT-5 in lechmazur data. Claude Opus 4.7 pricing is $5.00 per million input and $25.00 output - matching GPT-5.5 on input but undercutting it on output by 17%. Prompt caching drops reads to $0.50/M, which matters for workflows that reuse large system prompts. Translation benchmark results for Opus 4.7 specifically will appear when lechmazur or a comparable benchmark re-assesses April 2026 releases. See our Claude Opus 4.7 review for the full picture.

Grok 4.20

xAI released Grok 4.20 on March 31, 2026, as a direct successor to the original Grok 4. Input pricing dropped from $3.00/M to $1.25/M - a 58% reduction. The original Grok 4 scored 8.573/10 on the lechmazur benchmark in September 2025, placing second overall and earning the top score specifically for Japanese (8.703). No translation benchmark has yet assessed Grok 4.20 directly.

The lechmazur margin between Grok 4 (8.573) and GPT-5 (8.690) was 0.12 points - narrow enough that Grok 4.20 is realistically competitive at the frontier tier, and the lower price makes it easier to justify for Japanese, Korean, or Chinese translation workloads. Grok 4 also led Chinese (8.634) and Turkish (8.605) in the lechmazur data. For full specs, see the Grok 4.20 model page.

Minimax M2.5 Lightning

OpenMark's March 2026 benchmark put Minimax M2.5 Lightning at 71% - first place, ahead of GPT-5.4 at 61%. That result needs context about the methodology.

OpenMark scores using a deterministic approach: contains_all for required key terms and contains_any for acceptable translation variants. It doesn't use LLM-as-judge and doesn't involve human evaluation. This measures whether a translation contains the right vocabulary, not whether it reads naturally or handles idioms correctly. The overall spread - only 29 percentage points separating first from last - suggests the test set doesn't create meaningful separation at the top.

At $0.30 per million input tokens and 100 tokens per second throughput (double most frontier models), Minimax M2.5 Lightning suits high-volume structured translation: product listings, support tickets, UI strings. For content where fluency and cultural fit drive quality, it needs human-assessed benchmark results before you build it into a production pipeline. See the Minimax M2.5 model page for its broader benchmark profile.

A colorful world globe showing countries and place names in multiple languages AI translation now spans hundreds of language pairs, but quality varies sharply by pair and content type. Source: unsplash.com

Open-Source Options

Meta's open-source translation models are worth evaluating for teams with infrastructure capacity and non-commercial use cases.

NLLB-200 (No Language Left Behind) covers 200 languages and showed a 44% average improvement over prior state-of-the-art across 10,000 translation directions on FLORES-101. For 55 African languages and dozens of South and Southeast Asian languages, it's the only free option with usable quality. It powers Facebook and Instagram's translation layer at over 25 billion translations daily. Quality for major language pairs sits below frontier LLMs, but the breadth is unmatched.

SeamlessM4T v2 Large handles speech-to-speech, speech-to-text, text-to-speech, and text-to-text in a single model at 2.3B parameters. It supports 101 languages for speech input and 96 for text. On the English-target direction, it beats NLLB-3.3B; it trails by only 0.2 BLEU points going the other way. For speech translation pipelines, it beats cascaded Whisper + NLLB + TTS systems by 2.1 ASR-BLEU points. The CC-BY-NC-4.0 license requires a commercial agreement with Meta.

Qwen-MT from Alibaba is purpose-built for translation across 92 languages, using reinforcement learning-based training on a lightweight MoE architecture. In Chinese-English and English-German pairs, it beats GPT-4.1-mini and Gemini 2.5 Flash while staying competitive with larger closed models on WMT24 multilingual benchmarks. At $0.50/M output tokens, it's the cheapest option with verified performance on CJK pairs. Qwen-MT from Alibaba Cloud also supports customizable prompts for domain-specific terminology, which matters for legal and medical translation.

For open-source text translation in high-resource language pairs, Qwen3-235B and Llama 4 Maverick are competitive with mid-tier commercial models when self-hosted. The open-source LLM leaderboard tracks current rankings across tasks including translation.

Earth photographed from orbit at night, city lights illuminating a connected world Meta's NLLB-200 targets the long tail of low-resource languages spoken by communities far from major tech hubs. Source: unsplash.com

API Pricing Reality Check

Specialized NMT APIs charge per character; LLM APIs charge per token. Rough conversion: 4 characters per token for European languages, 2-3 characters for CJK languages.

Service	Approx. Cost ($/M chars, EN-ES)	Language Support
DeepL Pro	$25	33 languages
Google Cloud Translate	$20	133 languages
Microsoft Translator	$10	110+ languages
GPT-5.5 (LLM)	~$8-12	50+ languages
Claude Opus 4.7 (LLM)	~$7-10	50+ languages
Gemini 3.1 Pro (LLM)	~$3-5	100+ languages
Grok 4.20 (LLM)	~$2-4	50+ languages
DeepSeek V4 (LLM)	~$2-4	30+ languages
Qwen-MT (LLM)	~$1.50-2.00	92 languages
Minimax M2.5 Lightning (LLM)	~$0.50-1.00	30+ languages

LLMs are cheaper per character than specialized NMT APIs at the same quality tier. The tradeoff is throughput - specialized NMT systems are faster at high volume. For a full breakdown of LLM API costs, see the LLM API pricing comparison.

Methodology

Rankings draw from three sources, weighted differently.

OpenMark March 2026 (primary for models released in 2026) - An evaluation of 25 models across six languages and ten direction pairs. Scoring is deterministic: contains_all for required key terms, contains_any for acceptable variants. No LLM-as-judge, no human evaluation. This produces stable results but can't distinguish fluent translation from technically correct but stilted output. Use it to screen out poor performers; don't rank quality differences within the top cluster by it alone.

WMT25 Human Evaluation (primary lineage check) - The annual academic benchmark for machine translation. The 2025 edition covered 32 language pairs with professional human annotators using Error Span Annotation, rating each error as Minor or Major. Final results published November 2025. Gemini 2.5 Pro placed at the top overall. WMT26 is scheduled for November 2026 and will be the next formal human evaluation.

lechmazur Round-Trip Benchmark (secondary) - A community benchmark running ten languages by 200 source texts with LLM-as-judge scoring (five judges, 0-10 scale). Last updated September 15, 2025 - before Grok 4.20, Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.7, and DeepSeek V4 were released. The September 2025 scores remain the best published human-preference data for that generation of models, but the field has moved on.

BLEU Scores (tertiary, NMT tools) - Used for DeepL and Google Translate comparisons, sourced from intlpull.com's 2026 benchmark. BLEU measures n-gram overlap against reference translations and correlates poorly with human judgments for nuanced content. Never compare BLEU across different test sets.

The TOWER-v2-70B lesson from WMT24 is worth repeating. That model led all 11 language pairs on COMET scores but lost to Claude 3.5 Sonnet on 9 of 11 in human evaluation. The organizers concluded TOWER had been optimized for the metric rather than actual translation quality. Any model with only deterministic or BLEU scores should be treated as a screening filter, not a quality verdict.

For broader multilingual model comparisons, the multilingual LLM leaderboard tracks performance across reasoning, comprehension, and generation.

Colorful code on a dark monitor screen representing AI evaluation processes WMT25 used Error Span Annotation across 32 language pairs, with professional annotators rating each error as Minor or Major. Source: unsplash.com

Historical Progression

Early 2024 - Specialized encoder-decoder NMT systems still compete with GPT-4 in formal WMT evaluations. LLMs excel at idiomatic text but trail on structured document accuracy.
WMT24, November 2024 - TOWER-v2-70B wins all 11 language pairs on automatic metrics. Human evaluation reverses the result: Claude 3.5 Sonnet wins 9 of 11. The findings paper: "The LLM Era Is Here but MT Is Not Solved Yet."
Early 2025 - Claude 3.5 Sonnet leads Lokalise's professional translator study at 78% "good" rating. Frontier LLMs establish themselves as the quality ceiling for European and East Asian pairs.
WMT25 and lechmazur, Q3-Q4 2025 - Gemini 2.5 Pro tops WMT25 human evaluation (14 of 16 pairs, November). lechmazur data from September 2025 puts GPT-5 first (8.690), Grok 4 second (8.573), Claude Opus 4.1 third (8.559), Gemini 2.5 Pro fourth (8.529).
Q1 2026 - Gemini 3.1 Pro launches February 19. OpenMark's March benchmark shows it tied at 61% with GPT-5.4 and Kimi K2.5. Grok 4.20 launches March 31 at $1.25/M - 58% cheaper than the original Grok 4.
April 2026 - GPT-5.5 (April 23) and DeepSeek V4 (April 24) launch within a day of each other. Both post-date available translation benchmarks. WMT26 (November 2026) will be the next formal human evaluation.

FAQ

What's the best AI model for translation in May 2026?

Gemini 3.1 Pro leads verified 2026 benchmarks at $2/M input tokens. For literary or tone-sensitive content, Claude Opus 4.7 has the strongest professional translator approval lineage. For CJK at high volume and low cost, Qwen-MT at $0.50/M output is the most cost-effective option.

Is DeepL still better than LLMs for European languages?

On BLEU scores for EU pairs, yes - DeepL scores 62-65 vs. GPT-4's 59-62. BLEU doesn't capture tone, idioms, or hallucination rates, though. Human studies show Claude and GPT produce fewer post-editing corrections per page for content where register matters.

Why is Minimax M2.5 Lightning ranked 5th despite being OpenMark's #1?

OpenMark uses deterministic key-term matching, not human evaluation. It checks whether required vocabulary is present, not whether the translation reads naturally. Until Minimax M2.5 Lightning appears in a human-evaluated benchmark, the 71% score is a floor, not a ceiling.

How often do translation rankings change?

Fast - major reshuffles happen every 2-3 months. GPT-5.5 and Claude Opus 4.7 launched in April 2026 with no translation benchmark yet. WMT26 will be the next final comparison in November 2026. Plan for quarterly review of production model choices.

Can open-source models replace commercial APIs for translation?

For major language pairs, Qwen3-235B and Llama 4 Maverick are competitive with mid-tier commercial options when self-hosted. For rare or low-resource languages, NLLB-200 is often the only viable free option. For peak quality, frontier LLMs still lead.

Which benchmark should I trust for translation quality?

WMT human evaluation using Error Span Annotation is the gold standard. COMET is the best automatic metric but was gamed in WMT24. BLEU doesn't capture nuance. OpenMark's deterministic scoring is useful for initial screening but shouldn't determine final production choices.

Sources: