Retrieval-augmented generation has become the default architecture for anything that needs to answer questions from a document corpus without hallucinating facts. If you're new to the concept, our RAG explainer guide covers the basics. The benchmark ecosystem around it has matured in parallel: today there are distinct evaluation tracks for retrieval quality, multilingual coverage, multi-hop reasoning, and end-to-end faithfulness. This leaderboard covers all of them.

Two things make RAG benchmarking unusual compared to language modeling leaderboards. First, you're assessing two different components - the embedding or retrieval model, and the generation model that reads what was retrieved. A great retriever paired with a weak generator will still hallucinate. Second, scores mean different things across benchmarks. A NDCG@10 on BEIR isn't the same animal as an Exact Match score on NQ, and comparing them directly misleads more than it informs.

TL;DR

Gemini Embedding 2 leads the MTEB retrieval track with a 68.32 average and 67.71 on retrieval tasks specifically - the widest margin Google has held on this leaderboard
Voyage-4-large, released in January 2026 with a Mixture-of-Experts architecture, beats OpenAI text-embedding-3-large by 14% on NDCG@10 across 29 retrieval domains
For hallucination in RAG generation, GPT-4o and Claude 3.5 Sonnet remain the best commercial options - fine-tuned open models on RAGTruth data can approach that quality at much lower cost

What Each Benchmark Measures

RAG evaluation splits cleanly into retrieval benchmarks and end-to-end RAG benchmarks. They test different failure modes.

Retrieval benchmarks score how well a model finds the right documents given a query. The dominant metric is NDCG@10 (normalized discounted cumulative gain at rank 10), which rewards returning relevant documents at the top of the ranking and penalizes burying them lower down. Most retrieval benchmarks are zero-shot by design - models shouldn't be trained on the evaluation data.

End-to-end RAG benchmarks score the full pipeline: does the produced answer actually match the gold answer, and does it stick to what was retrieved? The key metrics here are Exact Match (EM), F1, and faithfulness scores. A model that scores well on retrieval but has high hallucination rates on RAGTruth has a generation problem, not a retrieval problem.

BEIR - The Generalization Test

BEIR (Benchmarking Information Retrieval) covers 18 diverse datasets including MS MARCO, Natural Questions, HotpotQA, and domain-specific corpora spanning biomedical, legal, scientific, and technical content. The benchmark is intentionally heterogeneous - a model that scores well here generalizes to out-of-distribution queries, which is what you actually want in production. BM25 scores around 42 NDCG@10 as the sparse baseline; modern dense models consistently clear 60.

MTEB Retrieval Track

The Massive Text Embedding Benchmark includes a 15-dataset retrieval sub-track that overlaps with BEIR. MTEB is more structured about evaluation conditions and covers additional task categories (STS, classification, clustering) that let you see how retrieval performance relates to other embedding capabilities. It's the most widely used comparative leaderboard for embedding models.

MIRACL - Multilingual Retrieval

MIRACL covers 18 languages with native-speaker annotation for each, testing monolingual retrieval where both queries and documents are in the same language. The dataset spans languages from Arabic and Bengali to Swahili and Telugu. A model that scores well on English MTEB can still fall apart here - many commercial APIs drop 20-30% in retrieval quality when switching from English to lower-resource languages. The evaluation metric is nDCG@10 and Recall@100.

In May 2025, MIRACL-VISION extended the benchmark to visual document retrieval across the same 18 languages, exposing a different problem: state-of-the-art VLM embedding models lag text-based retrieval models by up to 59.7% in multilingual visual retrieval accuracy.

MS MARCO - Large-Scale Passage Ranking

MS MARCO is a large-scale dataset from real Bing queries and human-annotated passage relevance. The passage ranking task uses MRR@10 as the primary metric. Unlike BEIR, which tests generalization, MS MARCO is the closest thing to an in-distribution retrieval test for English web queries. Models optimized for MS MARCO tend to excel here but don't always transfer to domain-specific corpora.

KILT - Knowledge-Intensive Tasks

KILT (Knowledge Intensive Language Tasks, from Meta AI) tests retrieval over a Wikipedia snapshot for 11 downstream tasks: fact-checking, open-domain QA, slot filling, entity linking, and dialogue. Systems need to retrieve supporting evidence and create answers that depend on that evidence. CoRAG-8B, released in January 2025, currently holds state-of-the-art scores across most KILT tasks, outperforming systems built on much larger LLMs.

HotpotQA and Natural Questions

Both are multi-hop and open-domain QA benchmarks respectively. HotpotQA requires retrieving evidence from multiple documents and connecting it to answer questions that can't be answered from any single passage. NQ uses real Google search queries against Wikipedia. EM (Exact Match) and F1 are the standard metrics, and scores here reflect the combined quality of both retrieval and generation.

RAGTruth - Hallucination in Generation

RAGTruth is a corpus of nearly 18,000 RAG responses from multiple LLMs, annotated at the word level for hallucination type and severity. It's the most detailed benchmark specifically for the generation side of RAG - distinguishing between cases where the model invented unsupported facts, contradicted the retrieved context, or produced partially correct but misleading answers. The benchmark was introduced at ACL 2024 and has become the standard reference for comparing how different LLMs handle retrieved context faithfully.

Retrieval quality varies significantly across document types, languages, and query styles Retrieval benchmarks like BEIR test generalization across domains - medical, legal, code, and web queries each expose different model weaknesses. Source: unsplash.com

Retrieval Model Rankings

MTEB Retrieval + BEIR Leaders

Rank	Model	Provider	MTEB Retrieval (nDCG@10)	BEIR Avg (nDCG@10)	Type	Pricing
1	Gemini Embedding 2	Google	67.71	~67.7	API	$0.20/M tokens ($0.10 batch)
2	Voyage-4-large	Voyage AI	~66.0	~66.0	API	200M free; paid tier
3	NV-Embed-v2	NVIDIA	62.65	59.36	Open-weight	Free
4	Qwen3-Embedding-8B	Alibaba	~62.0	~62.0	Open-weight	Free
5	Cohere Embed v4	Cohere	~61.0	~61.0	API	$0.12/M tokens
6	OpenAI text-embedding-3-large	OpenAI	~59.0	~59.0	API	$0.13/M tokens
7	BGE-M3	BAAI	~58.0	~58.0	Open-weight	Free
8	Voyage-3.5	Voyage AI	~57.5	~57.5	API	200M free; paid tier
9	GTE-ModernBERT-base	Alibaba/Community	64.38*	-	Open-weight	Free
10	BM25	-	~42.0	~42.0	Sparse baseline	Free

*GTE-ModernBERT-base score is on the full MTEB English average (not retrieval-only), using 149M parameters. It's included as the strongest efficient open-weight option under 150M parameters.

A few caveats on this table. Voyage AI publishes scores on their own RTEB benchmark (29 datasets across 8 domains) rather than MTEB, which makes direct comparison to MTEB scores approximate. On their own evaluation, voyage-4-large beats OpenAI text-embedding-3-large by 14% and Cohere Embed v4 by 8.2% in NDCG@10. MTEB is cross-vendor but self-reported; RTEB is vendor-controlled but more thorough. Neither is fully neutral.

MIRACL Multilingual Rankings

For multilingual retrieval, the picture looks different:

Rank	Model	nDCG@10 avg (MIRACL 18-lang)	Notes
1	NVIDIA Llama-Embed-Nemotron-8B	Best open-weight multilingual	Tops MMTEB across 250+ languages
2	Qwen3-Embedding-8B	~70.58 (MMTEB avg)	Strong on East Asian languages
3	Cohere Embed v4	Strong	100+ languages; cross-lingual retrieval
4	Gemini Embedding 2	Strong	Multimodal; weaker on low-resource
5	BGE-M3	~63.0	BAAI multilingual model

For production multilingual work, the best free option is NVIDIA Llama-Embed-Nemotron-8B. If you need a commercial API with broad language support, Cohere Embed v4 covers 100+ languages with cross-lingual retrieval (query in English, retrieve in French, etc.) and is the only major embedding API that also handles images natively.

End-to-End RAG: Generation Quality

The generation side of RAG gets less systematic attention than retrieval, but RAGTruth and KILT give us data points.

On the RAGTruth hallucination benchmark, LLM performance varies enough to matter in production. GPT-4o and Claude 3.5 Sonnet produce the fewest hallucinations against retrieved context. Fine-tuned smaller models trained on RAGTruth data can reach competitive faithfulness at lower cost - the original paper showed that fine-tuning Llama-2-13B on RAGTruth training data matched prompt-based approaches with GPT-4. That gap has closed further with stronger base models in 2025-2026.

On KILT, CoRAG-8B (Chain-of-Retrieval Augmented Generation, January 2025) posts state-of-the-art performance on multi-hop QA tasks including HotpotQA subsets, beating systems built on LLMs three to five times larger. CoRAG uses iterative retrieval - the model retrieves, reads, decides if it needs more evidence, and retrieves again. Single-pass RAG struggles with multi-hop questions by design.

Scoring dimensions for RAG system evaluation: retrieval quality, faithfulness, and answer accuracy End-to-end RAG evaluation requires tracking retrieval quality and generation faithfulness separately - strong retrieval with weak generation still produces hallucinated answers. Source: unsplash.com

Key Takeaways

Gemini Embedding 2 Sets a New Bar

Google released Gemini Embedding 2 into public preview in March 2026. It's built natively on the Gemini architecture and handles text, images, video, audio, and PDFs in a single 3,072-dimensional vector space. On MTEB English, it scores 68.32 overall and 67.71 on the retrieval sub-track, with Matryoshka support down to 768 dimensions.

The pricing is $0.20 per million tokens standard or $0.10 with batch processing. Indexing 1 million 500-token documents costs roughly $50 at batch rates. That's not cheap compared to open-weight models, but it's clearly less than the engineering cost of self-hosting a competitive open model. Early production reports mention 20% recall improvement and 70% latency reduction from users switching from older models - though those come from Google's own case studies, so treat them as directional.

The main limitation is vendor lock-in. Re-indexing a document corpus whenever you switch embedding models is a real operational burden. If you build on Gemini Embedding 2, plan for it long-term or architect an abstraction layer.

The MoE Architecture Arrives for Embeddings

Voyage-4-large, launched in January 2026, is the first production embedding model using a Mixture-of-Experts architecture. The MoE design routes different inputs to specialized sub-networks, delivering "serving costs 40% lower than comparable dense models" according to Voyage AI, while maintaining state-of-the-art accuracy. The model family includes voyage-4, voyage-4-lite, and voyage-4-nano (open-weight, Apache 2.0 on Hugging Face).

Voyage AI uses their own RTEB benchmark for evaluation rather than MTEB, which creates comparison friction. On RTEB, voyage-4-large outperforms Gemini Embedding 001 by 3.87% and OpenAI text-embedding-3-large by 14.05% in NDCG@10 across 29 datasets. Independent third-party scoring on MTEB puts the comparison closer, but Voyage models consistently sit in the top tier.

Open-Weight Models Are Genuinely Competitive

NVIDIA's NV-Embed-v2 posts 62.65 on the MTEB retrieval track - trailing Gemini Embedding 2's 67.71, but scoring higher than OpenAI text-embedding-3-large on retrieval specifically. NV-Embed-v2 reached 72.31 on the full MTEB average (56 tasks) using a two-stage contrastive instruction-tuning method. It's fully open-weight and runs on a single A100. For teams with GPU infrastructure already in place, self-hosting NV-Embed-v2 eliminates per-token costs completely.

Qwen3-Embedding-8B, Alibaba's multilingual model, scores 70.58 on the MMTEB multilingual leaderboard and supports flexible dimensions from 32 to 4096. Its code retrieval score on MTEB Code is 80.68, the highest reported for any model on that sub-benchmark. If your RAG system needs to retrieve code with natural language documentation, Qwen3-Embedding-8B is the current choice.

Our embedding model leaderboard from March 2026 covers the full MTEB picture in more depth, including dimension tradeoffs and Matryoshka support across providers.

BM25 Is Still Alive, But in a Support Role

Sparse retrieval with BM25 scores around 42 NDCG@10 on BEIR - below every modern dense model. But that doesn't mean BM25 is obsolete. Hybrid search combining BM25 with dense retrieval routinely beats either method alone, particularly on exact-match queries like product codes, person names, and technical identifiers. Dense models miss these because the learned vector space doesn't encode character-level patterns reliably. Most production RAG systems should use hybrid retrieval unless they have clear evidence that pure dense retrieval meets their precision requirements.

Multi-Hop RAG Requires a Different Stack

Standard single-pass RAG - retrieve once, produce once - handles simple factoid questions reasonably well. It struggles with questions requiring evidence from multiple documents, temporal reasoning, or comparative analysis. On HotpotQA specifically, hybrid retrieval combining dense and lexical signals with maximum marginal relevance post-filtering achieves 20% absolute EM improvement over pure dense retrieval alone.

CoRAG-8B's iterative retrieve-read-decide-retrieve loop is the cleanest solution for multi-hop at inference time. It requires more compute per query (5-8 retrieval passes for hard questions), but the accuracy gains are significant. For production systems handling complex enterprise queries, iterative retrieval is worth the latency cost.

RAGPerf Brings System-Level Measurement

Most RAG benchmarks measure either retrieval quality or answer quality in isolation. RAGPerf, released on arXiv in March 2026, provides an end-to-end framework that simultaneously tracks context recall, query accuracy, factual consistency, latency, throughput, and GPU/memory use across a full pipeline. It supports multiple vector databases (LanceDB, Milvus, Qdrant, Chroma, Elasticsearch) and is open-sourced at platformxlab/RAGPerf. For teams building production systems, this is the first framework that lets you see retrieval accuracy tradeoffs against actual hardware costs in a single run.

Practical Guidance

The right choice depends on which part of the pipeline you're improving.

For best retrieval accuracy (API): Gemini Embedding 2 leads on MTEB retrieval at 67.71 NDCG@10. If you don't want Google lock-in, voyage-4-large is the alternative, with strong RTEB numbers and a lower-cost MoE architecture.

For best retrieval accuracy (self-hosted): NV-Embed-v2 scores 62.65 on MTEB retrieval and is fully open-weight. Qwen3-Embedding-8B is preferred for multilingual or code-heavy workloads.

For multilingual RAG: NVIDIA Llama-Embed-Nemotron-8B leads the MMTEB multilingual benchmark and is open-weight. For a commercial API, Cohere Embed v4 covers 100+ languages with cross-lingual support and multimodal capability.

For budget-constrained pipelines: OpenAI text-embedding-3-small at $0.02/million tokens offers a reasonable retrieval score for the price. Voyage-4-nano is Apache 2.0 licensed, freely deployable, and designed for high-throughput low-cost retrieval.

For multi-hop or complex queries: Single-pass RAG won't cut it. Build iterative retrieval into your pipeline. CoRAG-8B's approach is the current benchmark reference.

For minimizing hallucination: Choose your generation model carefully. On RAGTruth evaluations, GPT-4o and Claude 3.5 Sonnet produce the fewest hallucinations when given retrieved context. Fine-tuning a smaller model on RAGTruth-style data is a viable lower-cost path. See our guide to best RAG tools in 2026 for framework-level options including LangChain, LlamaIndex, and Haystack.

What benchmark to trust: For retrieval comparison, use MTEB as the cross-vendor neutral standard. Be skeptical of vendor-internal benchmarks (including Voyage's RTEB) when they're the only numbers cited. For end-to-end faithfulness, RAGTruth is the most granular public reference. For multilingual, MIRACL and MMTEB are the appropriate benchmarks - English MTEB scores don't predict multilingual performance reliably.

FAQ

Which embedding model is best for RAG in 2026?

Gemini Embedding 2 leads on MTEB retrieval with 67.71 NDCG@10. For self-hosted options, NV-Embed-v2 and Qwen3-Embedding-8B are both competitive. For multilingual, NVIDIA Llama-Embed-Nemotron-8B leads.

What is the BEIR benchmark?

BEIR tests retrieval models across 18 diverse datasets covering web, biomedical, legal, and scientific content. It measures zero-shot generalization - models shouldn't be fine-tuned on BEIR data before evaluation. BM25 baseline is roughly 42 NDCG@10.

How does MTEB differ from BEIR?

MTEB includes the BEIR retrieval datasets plus additional task categories (STS, classification, clustering). MTEB is broader and used to compare general embedding quality. BEIR focuses exclusively on retrieval generalization.

What is RAGTruth and why does it matter?

RAGTruth is a 18,000-response dataset annotated for hallucination type and severity in RAG outputs. It's the standard benchmark for comparing how faithfully different LLMs stick to retrieved context when creating answers.

Is BM25 still relevant for RAG?

Yes. Hybrid retrieval combining BM25 with dense embeddings beats pure dense retrieval on exact-match queries. Most production RAG systems use hybrid search. Pure BM25 scores around 42 NDCG@10 on BEIR, well below modern dense models, but it handles keyword precision that dense models miss.

How often do RAG benchmark rankings change?

Retrieval rankings shift 2-4 times per year as major providers release new models. The current leaders (Gemini Embedding 2, Voyage-4-large) were both released or updated in early 2026. Check this page and the MTEB leaderboard on Hugging Face for the latest.

Sources: