Retrieval-augmented generation is the dominant architecture for production LLM deployments that need accurate, up-to-date answers from private or domain-specific knowledge bases. The model is only one component — retrieval quality, chunking strategy, and reranking all matter — but model selection has an outsized effect on the quality of the final output, particularly for tasks where hallucination is costly.
Most benchmark suites evaluate models in isolation: closed-book QA where the model answers from parametric knowledge. RAG flips this contract. The model is given retrieved context and must read, synthesize, and answer from it — staying faithful to the source even when the source conflicts with its training data. That's a different capability profile than general reasoning, and the ranking reflects it.
What RAG Actually Requires
A model deployed in a RAG pipeline needs to do several things well simultaneously:
Faithfulness to retrieved context — The model must produce answers grounded in the retrieved passages, not answers constructed from training data. When the retrieved context says one thing and the model's training data implies another, the correct behavior is to follow the context. Models that override retrieved content with memorized knowledge are the central failure mode in RAG deployments.
Citation quality — Strong RAG models don't just produce accurate answers; they attribute claims to specific retrieved passages in ways that are both accurate (the cited passage actually supports the claim) and complete (claims without support are flagged, not silently omitted). Citation quality is hard to benchmark and easy to fake — models that produce plausible-looking citations that don't match retrieved content are worse than models that decline to cite.
Multi-document synthesis — Real retrieval pipelines return multiple passages, often with overlapping, redundant, or contradictory information. The model must identify where documents agree, where they conflict, and produce a synthesized answer that reflects the actual state of the evidence — not just the first passage.
Reading long contexts without degrading — RAG context windows are often large. Retrieved passages plus conversation history can run to 20K–80K tokens in production systems. Models that maintain accuracy and faithfulness at long contexts are materially more valuable in RAG pipelines than models with equivalent short-context capability.
Faithfulness and accuracy are the primary signals for RAG performance in our data. Models that score high on factual grounding — staying close to what the evidence supports — outperform models with higher general reasoning scores that tend to substitute memorized knowledge for retrieved context.
How RAG Differs from Open-Ended QA
The intuition that "smarter models are better at RAG" is partially wrong. Highly capable models trained on large corpora have strong opinions about many factual questions — opinions encoded in their weights. In RAG settings, these opinions can interfere with faithful use of retrieved context.
Retrieval replaces memory, but only if the model lets it. A model that "knows" the answer to a question from training may produce that answer even when the retrieved context says something different. The failure is subtle — the output often looks well-reasoned and the incorrect answer may be plausible — but it defeats the core purpose of RAG.
Instruction-following quality is underrated. A prompt that says "answer only from the provided context; if the context is insufficient, say so" is an instruction. Models that follow it accurately, even when their parametric knowledge would let them produce a more complete-sounding answer, are more trustworthy in RAG pipelines than models with higher general capability that override the constraint.
Context window position matters. "Lost in the middle" is an empirically documented failure mode where models underweight information that appears in the middle of long contexts relative to information at the beginning or end. For RAG systems where relevant passages may appear anywhere in the retrieved set, this matters practically.
The Benchmark Landscape
FRAMES (Factuality, Retrieval, And Reasoning for Multi-document Evaluation from Google DeepMind) is the most comprehensive benchmark for RAG-adjacent evaluation. It tests multi-hop reasoning over retrieved documents and requires both accurate synthesis and appropriate grounding.
Knowledge-intensive QA benchmarks (TriviaQA, NQ, PopQA in open-book settings) measure how well models use provided context to answer factual questions — the closest standard benchmark proxy for RAG faithfulness.
Accuracy dimension scores in our leaderboard incorporate factual grounding signals across benchmarks, making them a strong proxy for RAG performance. Models with high accuracy scores tend to stay close to evidence, which is exactly the behavior RAG requires.
Long-context benchmarks (RULER, "needle in a haystack" variants) measure whether retrieved information is actually used when embedded in long context windows — directly relevant to production RAG where retrieved passages appear at arbitrary positions in long prompts.
RAG performance is highly pipeline-dependent. Retrieval quality, chunking strategy, and embedding model choices often have as much impact as model selection. Benchmark RAG performance on your specific corpus and retrieval configuration before committing to a production model — general benchmarks are proxies, not predictions.
Current Rankings
Knowledge base Q&A (with citations)
business productivity
| # | Model | Score |
|---|---|---|
| 1 | gemini-2.5-pro external/google/gemini-2-5-pro | 34.9 |
| 2 | gpt-5-2025-08-07 external/openai/gpt-5-2025-08-07 | 33.8 |
| 3 | google/gemini-3.1-pro-preview external/google/gemini-3-1-pro-preview | 33.3 |
| 4 | gemini-3-pro-preview external/google/gemini-3-pro-preview | 32.3 |
| 5 | anthropic/claude-sonnet-4.6 external/anthropic/claude-sonnet-4-6 | 28.9 |
| 6 | gpt-5-mini-2025-08-07 external/openai/gpt-5-mini-2025-08-07 | 28.4 |
| 7 | Grok-4-0709 external/xai/grok-4-0709 | 27.8 |
| 8 | google/gemini-3.1-flash-lite-preview external/google/gemini-3-1-flash-lite-preview | 23.3 |
| 9 | gemini-3-flash-preview external/google/gemini-3-flash-preview | 23.1 |
| 10 | gpt-5.2-2025-12-11 external/openai/gpt-5-2-2025-12-11 | 22.6 |
| 11 | openai/gpt-5.4-2026-03-05 external/openai/gpt-5-4-2026-03-05 | 21.7 |
| 12 | anthropic/claude-sonnet-4 external/anthropic/claude-sonnet-4 | 21.2 |
| 13 | gpt-4.1-20250414 external/openai/gpt-4-1-20250414 | 20.2 |
| 14 | gemini-2.5-flash external/google/gemini-2-5-flash | 20.0 |
| 15 | claude-opus-4-5-20251101 external/anthropic/claude-opus-4-5-20251101 | 19.7 |
Reading These Rankings
Accuracy-dimension models dominate. The models at the top of these rankings consistently have high accuracy dimension scores, which weight factual grounding and benchmark performance on knowledge-intensive tasks. This pattern is more consistent for RAG than for any other use case we track.
Reasoning model advantage is use-case-specific. For multi-hop RAG where answering requires chaining across multiple retrieved passages — synthesizing facts that no single passage contains — extended-thinking models have a meaningful advantage. For single-passage QA, the advantage is much smaller and may not justify the added latency and cost.
Context window depth varies. Some models maintain accuracy across 32K–128K token contexts; others degrade measurably as context length increases. For RAG pipelines that retrieve more than a few passages, verify the model's long-context performance rather than assuming headline context window sizes reflect usable depth.
Open-weight models are competitive. Several strong open-weight models rank competitively for RAG tasks, particularly for instruction-following faithfulness. For private data where sending retrieved context to external APIs raises security concerns, open-weight alternatives deployed on-premises are worth evaluating seriously.
Related Use Cases
- Document summarization — Multi-document synthesis without the retrieval dependency
- Accuracy rankings — Full model rankings on the Accuracy dimension, the primary signal for RAG performance
Full use-case rankings at /use-cases. Methodology at /methodology.