Best AI Models for RAG - June 2026
Gemini 2.5 Flash still leads LIT-RAGBench English RAG accuracy at 87.0%, but the full benchmark data reveals two overlooked entries: GPT-4.1-mini at 84.1% and o4-mini at 83.9%.

TL;DR
- Gemini 2.5 Flash still leads LIT-RAGBench English at 87.0% for $0.30/1M input - no model has displaced it since March
- GPT-4.1-mini (84.1%) and o4-mini (83.9%) were in the original benchmark paper but got less coverage; both slot between o3 and GPT-5 on English RAG accuracy
- Models released after March 2026 (Gemini 3.x, GPT-5.4+, Claude Opus 4.8, Llama 4, DeepSeek V4) haven't been assessed on LIT-RAGBench yet
RAG is still a two-part problem. Retrieval finds the right chunks. Generation turns those chunks into accurate answers without adding facts that aren't there. Most benchmark noise conflates the two. This page focuses on generation quality specifically: given retrieved context, which model produces the most accurate, grounded, and appropriately cautious responses?
LIT-RAGBench (arxiv.org/abs/2603.06198), published March 2026, remains the most systematic head-to-head comparison for RAG generation. No new complete benchmark has replaced it since. The rankings here pull directly from that paper's full Table 2, including several models that early coverage didn't highlight.
For the retrieval layer, see our RAG benchmarks leaderboard and MTEB April 2026 embedding rankings.
Rankings Table
LIT-RAGBench covers five RAG task categories in English and Japanese: Information Integration, Reasoning, Logic, Table comprehension, and Abstention (correctly refusing when retrieved context doesn't support an answer). The score column is the English Main Average across the first four categories. Abstention is reported separately - it's a distinct capability from accuracy, and the two scores need to be read together.
| Rank | Model | Provider | LIT-RAGBench EN | Abstention EN | Price Input/1M | Verdict |
|---|---|---|---|---|---|---|
| 1 | Gemini 2.5 Flash | 87.0% | 88.3% | $0.30 | Best accuracy-per-dollar; table comprehension leader at 90.3% | |
| 2 | o3 | OpenAI | 85.2% | 83.3% | $2.00 | Top Japanese RAG; best multi-hop reasoning |
| 3 | GPT-4.1-mini | OpenAI | 84.1% | 86.7% | $0.40 | Near-o3 accuracy at one-fifth the cost |
| 4 | o4-mini | OpenAI | 83.9% | 90.0% | $1.10 | Best Logic score (90.0%); strong abstention |
| 5 | GPT-4.1 | OpenAI | 83.6% | 93.3% | $2.00 | 100% on Information Integration - only model to hit that mark |
| 6 | GPT-5 | OpenAI | 82.8% | 93.3% | $1.25 | Consistent across categories; high abstention like GPT-4.1 |
| 7 | Qwen3-235B (Thinking) | Alibaba | 81.5% | 91.7% | Free (self-host) | Best open-source with thinking mode; better abstention than Instruct |
| 8 | Qwen3-235B (Instruct) | Alibaba | 80.6% | 90.0% | Free (self-host) | Top open-source for standard inference pipelines |
| 9 | Gemini 2.5 Pro | 78.4% | 88.3% | $1.25 | Underperforms Flash on this benchmark; wrong model for RAG | |
| 10 | GPT-5-mini | OpenAI | 78.0% | 91.7% | ~$0.40 | Strong abstention; solid mid-tier option |
| 11 | GPT-5-nano | OpenAI | 73.7% | 81.7% | ~$0.15 | Cheapest paid tier; viable for simple document Q&A |
| 12 | Gemma 3 27B | 68.0% | 85.0% | Free (self-host) | Decent open-source option; clearly below Qwen3 | |
| 13 | Claude Sonnet 4.6 | Anthropic | 65.0% | 96.7% | $3.00 | Lowest accuracy; highest abstention - right pick for medical/legal |
| 14 | Llama 3.3 70B | Meta | 61.8% | 91.7% | Free (self-host) | Viable open-source base; well behind Qwen3 |
| 15 | Llama 3.1 8B | Meta | 39.6% | 83.3% | Free (self-host) | Only viable where compute cost is the only concern |
Models released after March 2026 - Gemini 3.x, GPT-5.4+, Claude Opus 4.7/4.8, Llama 4 Scout/Maverick, DeepSeek V4, Cohere Command A+ - haven't been assessed on LIT-RAGBench. No head-to-head RAG comparison exists for those models at time of writing.
Detailed Analysis
Gemini 2.5 Flash - Still Leads, Still Surprising
Three months since LIT-RAGBench published, Gemini 2.5 Flash holds the top position at 87.0% English Main Average. The surprise isn't that it leads - Flash-class models often do well on structured, context-bound tasks where instruction following matters more than open-ended generation. The surprise is the margin. It sits 2.2 points above the next API model (o3 at 85.2%), and that gap has held up as more models from the paper's full dataset got analyzed.
Flash's advantage shows up most on table comprehension tasks: 90.3% English, the highest category score of any model in the benchmark. RAG pipelines answering questions from tables, financial reports, or structured documents will see that difference directly.
At $0.30/1M input tokens, it's also the second-cheapest paid option. For high-volume production RAG, that pricing gap against o3 ($2.00/1M) or GPT-4.1 ($2.00/1M) adds up fast. If you're looking for a single model recommendation, Gemini 2.5 Flash is it.
GPT-4.1-mini and o4-mini - Two Overlooked Results
The biggest practical update since March isn't a new model. It's two models that were in the original LIT-RAGBench paper but got less attention in early coverage.
GPT-4.1-mini scores 84.1% English Main Average, making it the third-most-accurate API model in the benchmark - above o4-mini (83.9%), GPT-4.1 (83.6%), and GPT-5 (82.8%). At around $0.40/1M input tokens, the accuracy-per-dollar case is better than o3 by a significant margin. Teams running standard document Q&A pipelines should assess this model before defaulting to o3.
O4-mini at 83.9% has a different profile. Its Logic score is 90.0% - tied with o3's English Logic performance and the highest Logic result in the benchmark. Its abstention rate (90.0%) also clears o3 (83.3%), meaning it refuses more aggressively when context is ambiguous. For workflows where both reasoning accuracy and controlled refusal matter, o4-mini at ~$1.10/1M is worth a close look.
GPT-4.1 - Perfect on Information Integration
One finding from LIT-RAGBench Table 2 worth flagging: GPT-4.1 scores 1.000 (100%) on English Information Integration tasks. That's the only perfect category score in the entire benchmark - no other model hits it, including the higher-ranked Gemini 2.5 Flash.
Information Integration tests whether a model correctly combines facts from multiple retrieved chunks to answer a question that can't be answered from any single source. GPT-4.1 also scores 93.3% on abstention, equal to GPT-5 and second only to Claude Sonnet 4.6's 96.7%. If your primary RAG workload involves synthesizing across multiple documents rather than answering single-document factual questions, the full profile here is more competitive than the 83.6% overall rank suggests.
Claude Sonnet 4.6 - The Abstention Argument Still Holds
Claude Sonnet 4.6 scores 65.0% on English Main Average - the lowest of any API model in LIT-RAGBench. Reading that in isolation makes it look like a poor choice.
Its abstention rate is 96.7%, the highest of any tested model by a noticeable margin. Claude refuses to answer more often than any other model when retrieved context doesn't support a clear response. In domains where a confidently wrong answer is worse than an explicit refusal - medical QA, legal document search, financial data retrieval - that behavior matters more than the accuracy rank. The 65% figure reflects partly that the benchmark penalizes refusal in cases where other models create plausible-sounding wrong answers.
Anthropic released Claude Opus 4.8 in May 2026, reportedly with improved citation precision for retrieval workflows. It hasn't been evaluated on LIT-RAGBench, so it can't be placed in the ranking table with verified numbers.
Also see long-context retrieval scores if you're considering Claude for RAG pipelines over very large document corpora - that's a separate dimension where the Opus models lead.
Qwen3-235B - Open-Source Leader, Two Modes
Qwen3-235B-A22B in Instruct mode stays at 80.6% English Main Average. The full Table 2 data adds Thinking mode results: 81.5%, with abstention improving from 90.0% to 91.7%.
The tradeoff is latency. Thinking mode produces chain-of-thought reasoning tokens before the final answer. High-throughput pipelines with latency requirements should stick with Instruct mode. Workflows that can wait and where the marginal accuracy gain matters should test Thinking mode.
For the retrieval layer, Qwen3-Embedding-8B (70.58 MTEB composite) remains the top open-source embedding option. That makes Qwen3 a coherent end-to-end self-hosted RAG stack: Qwen3-235B for generation, Qwen3-Embedding-8B for retrieval, no external API calls needed.
Gemini 2.5 Pro - Worse Than Flash on RAG
Gemini 2.5 Pro at 78.4% English Main Average sits 8.6 points below Gemini 2.5 Flash. For most tasks, Pro leads Flash on reasoning and multimodal evaluations. LIT-RAGBench is one of the clear exceptions.
This isn't an oddity. Pro was optimized for complex reasoning and long-context synthesis. Flash was optimized for instruction following and high-throughput tasks. RAG generation - answering questions from provided context with minimal embellishment - sits closer to the Flash design point. If you're building a RAG pipeline on Google infrastructure, don't assume Pro is the safe choice because it's the "bigger model." On this specific task it isn't.
Gemini Embedding 2, released March 2026, maps text, images, video, audio, and documents into a single vector space - enabling multimodal RAG pipelines with a unified index.
Source: blog.google
Methodology
The primary benchmark is LIT-RAGBench (arxiv.org/abs/2603.06198), published March 2026. It covers five RAG task categories in both English and Japanese. Models are scored on their ability to answer questions strictly from provided context, with accuracy measured against gold-standard answers. All scores in the rankings table come directly from Table 2 of that paper.
LIT-RAGBench has two properties that make it more reliable than most RAG comparisons: it uses a bilingual dataset not widely published before the paper (lower contamination risk), and it separates Abstention from accuracy rather than conflating them.
Known gaps: The benchmark was run on models available before March 2026. Several significant models released since then haven't been assessed: Gemini 3.x, GPT-5.4/5.5, Claude Opus 4.7/4.8, Llama 4 Scout/Maverick, DeepSeek V4, and Cohere Command A+. General benchmark performance for those models exists but can't substitute for a dedicated RAG evaluation - Gemini 2.5 Pro's underperformance vs Flash demonstrates exactly why general capability rankings don't predict RAG scores.
Deprecated scores removed: RAGAS Faithfulness scores from Prem AI's March evaluation were removed in this update. That evaluation's dataset composition and retrieval settings were undocumented, making cross-model comparisons unreliable. RGB (arxiv.org/abs/2309.01431) covers 2023-era models only and has no current relevance.
No model in LIT-RAGBench exceeds 90% English Main Average. The benchmark is a genuine test.
Retrieval Layer - Embedding Rankings Updated
Generation gets most of the attention in RAG discussions. It shouldn't. Bad retrieval hands the LLM the wrong information, and no generation model recovers from that. The best answer the model can produce is still wrong if the retrieval layer sent the wrong chunks.
MTEB composite rankings as of May 2026 show KaLM-Embedding-Gemma3-12B at 72.32 composite score, now ahead of Qwen3-Embedding-8B (70.58). On the MTEB retrieval track specifically, Gemini Embedding 2 leads at 67.71 NDCG@10 - that distinction matters because composite MTEB includes classification, clustering, and STS tasks that don't reflect retrieval specifically.
For cross-lingual retrieval (multilingual RAG), Gemini Embedding 2 leads cross-lingual R@1 at 0.997, ahead of Qwen3-VL-2B (0.988) and Jina Embeddings v4 (0.985). OpenAI text-embedding-3-large (0.967) and Cohere Embed v4 (0.955) trail by clear margins on that task. If your RAG pipeline serves non-English users, the embedding model choice matters as much as the generation model choice.
BM25 scores around 41.2% on BEIR - more than 26 points below Gemini Embedding 2 on the same benchmark. If you're still using BM25 as your primary retrieval layer, that's the highest-leverage improvement available anywhere in your RAG stack.
See the RAG benchmarks leaderboard for a full retrieval model breakdown.
A RAG pipeline splits into retrieval (embedding model + vector store) and generation (LLM). The rankings table above covers generation quality only - see the retrieval leaderboard for embedding model rankings.
Source: evidentlyai.com
Historical Progression
Mid-2023 - RGB Benchmark published. GPT-3.5 scores 55% on Information Integration tasks (English). Counterfactual robustness is below 7% for all models tested - retrieved context errors pass through at high rates.
2024 - RAGAS framework gains wide adoption as an evaluation toolkit. RAGTruth (18,000 examples) and RAGBench (100,000 examples) publish hallucination taxonomies. The community splits between retrieval-focused evaluation (BEIR, MTEB) and generation-focused evaluation (RAGAS, RGB).
Early 2025 - Cohere Command R and Command R+ release with native grounded generation and inline citations. First models designed specifically for RAG rather than adapted from general-purpose base models.
Late 2025 - Context window expansion across major providers triggers the "long context vs RAG" debate. FRAMES and RULER benchmarks establish that long context doesn't replace RAG for most production cases - retrieval stays more cost-efficient and more accurate on most workloads.
March 2026 - LIT-RAGBench publishes comprehensive five-category evaluation across English and Japanese. Gemini 2.5 Flash leads English at 87.0%. Full model table includes o4-mini, GPT-4.1-mini, both Qwen3-235B modes, and Gemini 2.5 Pro. Gemini Embedding 2 launches native multimodal embedding.
April 2026 - DeepSeek V4 (1.6T MoE, 1M token context) releases. Strong general benchmark performance but no dedicated RAG evaluation at time of writing.
May 2026 - Cohere Command A+ releases (218B MoE, Apache 2.0, designed to run on 2x H100s). Positions as the lowest-latency open MoE option for enterprise RAG pipelines. KaLM-Embedding-Gemma3-12B reaches 72.32 MTEB composite, overtaking Qwen3-Embedding-8B for the top open-weight embedding position.
FAQ
What's the best model for RAG right now?
Gemini 2.5 Flash leads LIT-RAGBench English at 87.0% for $0.30/1M input. For multi-hop or complex reasoning tasks, o3 at 85.2% handles them better. Qwen3-235B is the top free self-hosted option at 80.6%.
What's the best budget option for RAG?
GPT-4.1-mini at approximately $0.40/1M input scores 84.1% - near-o3 accuracy at one-fifth o3's cost. Gemini 2.5 Flash at $0.30 leads the full benchmark. Self-hosted Qwen3-235B is free if you have the GPU infrastructure.
Is Claude good for RAG?
Claude Sonnet 4.6 scores 65.0% on LIT-RAGBench but leads all models on abstention at 96.7%. If your application can't afford hallucinated answers (medical, legal, financial), that refusal behavior may matter more than the accuracy rank.
Is open-source competitive for RAG?
Yes, at the generation layer. Qwen3-235B-A22B Instruct scores 80.6% - above Gemini 2.5 Pro (78.4%), GPT-5-mini (78.0%), and Claude Sonnet 4.6 (65.0%). Thinking mode pushes it to 81.5%. For teams with GPU infrastructure, Qwen3 is a serious option.
What about Gemini 3.x, GPT-5.4, or other newer models?
No dedicated RAG benchmark data exists for models released after March 2026. General benchmarks are available but don't predict RAG scores well - Gemini 2.5 Pro's 8.6-point underperformance vs Flash on LIT-RAGBench is exactly why you shouldn't assume the strongest general model wins on RAG-specific tasks.
How often do these rankings change?
Fast. New frontier models release quarterly, and dedicated RAG benchmarks lag by several months. The LIT-RAGBench scores cover models available through March 2026. Check the lastVerified date on this page and the hallucination benchmarks leaderboard for updates.
Sources:
- LIT-RAGBench: Benchmarking Generator Capabilities of LLMs in RAG
- LIT-RAGBench paper HTML - full Table 2 results
- 7 RAG Benchmarks Explained
- Gemini Embedding 2: Natively Multimodal Embedding Model
- MTEB Multilingual Benchmark Scores - May 2026
- FaithJudge: Evaluating Hallucination in RAG Systems
- RGB: Benchmarking Large Language Models in RAG
- RAGBench: Explainable Benchmark for RAG Systems
- Cohere RAG Documentation
- LLM API Pricing 2026
✓ Last verified June 10, 2026
