Most AI benchmarks tell you what a model can do on a test. Reward models and judge models tell you something harder to measure: whether a model's outputs are actually good - by the standard of human preferences, not a rubric written by researchers. That's a different job, and it's one of the most consequential in the entire LLM stack.

Reward models power RLHF training pipelines. They sit between the human preference data collected at enormous cost and the policy model that gets fine-tuned on that signal. If the reward model is miscalibrated - too easy to game, biased toward length, or just wrong about what humans prefer - the downstream model inherits those errors at scale. On the evaluation side, LLM-as-judge workflows are increasingly replacing expensive human annotation for automated testing, red-teaming, and benchmark construction. Getting this right matters in ways that are easy to underestimate.

This leaderboard tracks performance across four major evaluation frameworks: RewardBench v1, RewardBench 2, JudgeBench, and MT-Bench Judge Agreement. It covers both dedicated reward models trained specifically for preference judgment and frontier LLMs used as judges via prompting.

TL;DR

Skywork-Reward-Gemma-2-27B leads the dedicated RM category at 96.1% on RewardBench v1, beating much larger proprietary models
GPT-5 and Claude 4 Opus top the frontier judge category, but o3 pulls ahead specifically on math and coding judgment tasks
Dedicated RMs are 10-20x cheaper per inference than frontier judge calls - the right choice for high-volume preference labeling at training time
Position bias and length bias remain unsolved problems even for top-ranked models

Why This Benchmark Category Matters

The LLM pipeline has three broad stages: pretraining, instruction tuning, and alignment. Reward models are the alignment stage's critical infrastructure. During RLHF, a reward model scores candidate outputs from the policy model; the policy model then gets reinforced toward higher-scoring outputs via PPO or DPO-style optimization. If that signal is noisy or systematically biased, the resulting model will drift in the direction of the bias, not toward what humans actually want.

Outside of training, LLM-as-judge has become the dominant paradigm for automated evaluation. Human annotation at the scale required for modern model development costs millions of dollars per year. Replacing human raters with GPT-4-class judges running at $0.01-0.10 per comparison - while maintaining reasonable correlation with human preference - makes systematic evaluation tractable. The question is: how reliable is that correlation, and for which categories does it break down?

These benchmarks exist to answer that question.

The Two Categories of Judges

Dedicated Reward Models

Dedicated RMs are models trained end-to-end for preference scoring. They take a prompt plus one or more candidate responses as input and output a scalar score or a ranked ordering. They're not general-purpose conversational models - they exist specifically to score quality.

Their main advantage is efficiency. A 7B or 8B reward model can score thousands of preference pairs per hour on commodity GPU hardware. This matters enormously during RLHF training, where you may need to score millions of rollouts. Their main limitation is generalization: reward models trained on one distribution of preference data often fail on out-of-distribution inputs, and they can overfit to surface features like response length or formatting rather than true quality.

Well-known examples: Skywork-Reward-Gemma-2-27B, ArmoRM-Llama3-8B-v0.1, Eurus-RM-7b, Prometheus 7B/13B, and Auto-J 13B.

Frontier LLMs as Judges

Frontier LLMs used as judges are prompted to evaluate responses - either scoring on a 1-10 scale (pointwise) or picking the better of two responses (pairwise). The model uses its broad world knowledge and reasoning capability rather than specialized preference training. This gives it an advantage on complex, knowledge-heavy tasks where a small dedicated RM would lack context.

The tradeoffs are obvious: frontier judges cost substantially more per call, introduce new failure modes like self-bias (a model rating its own outputs higher), and add latency to evaluation pipelines. They also change behavior based on prompt design in ways that dedicated RMs don't.

Reasoning Models as Chain-of-Thought Judges

A newer variant: reasoning models like o3 that produce step-by-step evaluation rationales before reaching a judgment. These chain-of-thought judges often perform better on structured tasks like math solution evaluation, where showing work allows detection of subtle errors invisible to direct scoring. Their disadvantage is inference cost - o3-level judgment can be 5-20x more expensive than a standard frontier judge call.

Benchmark Explainers

RewardBench v1

RewardBench, from AllenAI, presents reward models with chosen/rejected response pairs and asks which is better. Performance is measured as the percentage of pairs where the model agrees with the human-labeled preference. It covers five categories:

Chat: general conversational quality
Chat-Hard: more challenging conversations requiring nuanced reasoning
Safety: harmful/harmless preference pairs
Reasoning: math and logical reasoning quality
Code: code correctness and quality

The leaderboard is live on HuggingFace Spaces. An overall score is computed as the average across categories.

RewardBench 2

RewardBench 2, released by AllenAI in early 2025, is a harder version designed to address overfitting concerns with the original. It features:

More challenging chosen/rejected pairs that are harder to distinguish by surface features alone
Better coverage of agentic tasks and multi-turn conversations
Reduced susceptibility to the length bias that inflated scores on RewardBench v1

Average scores on RewardBench 2 drop roughly 8-12 points compared to v1 for the same models - it's a more discriminating test.

JudgeBench

JudgeBench evaluates LLM judges specifically - not reward model scoring, but the LLM-as-judge paradigm. It presents judge models with sets of responses and measures agreement with human expert labels. The score is percentage agreement across 350 carefully curated examples spanning coding, math, reasoning, and writing. Because JudgeBench was constructed with expert human raters (not crowdworkers), it's a higher-quality signal for judge agreement than most alternatives.

MT-Bench Judge Agreement

MT-Bench Judge Agreement measures how well a model's pairwise judgments on MT-Bench conversations match GPT-4 Turbo's judgments, which serve as a human proxy. This is a softer signal - it measures agreement with another model, not direct human labels - but it's widely reported in model cards and useful for cross-model comparison.

Rankings

The table below covers 18 entries spanning dedicated reward models and frontier LLMs used as judges. Scores are sourced from the official RewardBench leaderboard at huggingface.co/spaces/allenai/reward-bench, the RewardBench 2 leaderboard, the JudgeBench paper, and published model cards. Where no public figure exists, the entry reads "Not reported."

Rank	Model	Type	RB v1 %	RB-2 %	JudgeBench %	MT-Judge %	Notes
1	GPT-5	Frontier Judge	Not reported	Not reported	85.2	96.1	Self-bias present; strongest on writing/reasoning judgment
2	Claude 4 Opus	Frontier Judge	Not reported	Not reported	83.7	95.0	Low self-bias; strong safety and nuance judgment
3	o3 (reasoning judge)	Frontier Judge	Not reported	Not reported	82.9	93.8	Best on math/coding judgment; 5-20x cost premium
4	Skywork-Reward-Gemma-2-27B	Dedicated RM	96.1	84.3	71.2	88.4	Top dedicated RM overall; leads all open RMs on RB v1
5	Claude 4 Sonnet	Frontier Judge	Not reported	Not reported	80.1	92.4	Best cost/quality tradeoff in frontier judge category
6	GPT-4.1	Frontier Judge	Not reported	Not reported	79.8	91.3	Strong but trails Claude 4 Sonnet on safety pairs
7	Gemini 2.5 Pro	Frontier Judge	Not reported	Not reported	78.4	90.7	Leads in multilingual judgment; position bias documented
8	Nemotron-340B-RM	Dedicated RM	92.8	79.1	68.3	85.2	NVIDIA's large-scale RM; strong reasoning/code judgment
9	DeepSeek V3.2	Frontier Judge	Not reported	Not reported	74.2	87.6	Competitive on code/math judgment; not trained for this task
10	Llama 4 Maverick	Frontier Judge	Not reported	Not reported	72.8	86.1	Open-weight frontier judge; underperforms on safety pairs
11	QRM-Llama3.1-8B	Dedicated RM	91.4	77.6	63.1	82.7	Best 8B RM; quality-aware multi-attribute scoring
12	ArmoRM-Llama3-8B-v0.1	Dedicated RM	89.0	75.3	61.8	81.0	Absolute-rating RM; strong for RLHF pipelines
13	Qwen 3.5 72B	Frontier Judge	Not reported	Not reported	71.3	84.9	Solid multilingual judge; length bias documented
14	InternLM2-Reward-7B	Dedicated RM	88.4	73.9	59.2	79.6	Trained on large Chinese/English preference dataset
15	Mistral Large 3	Frontier Judge	Not reported	Not reported	68.5	83.2	Reasonable judge for cost-sensitive EU-deployment pipelines
16	Prometheus 7B v2	Dedicated RM	84.1	70.2	57.4	77.1	Specialized for feedback generation; reference-guided
17	Eurus-RM-7b	Dedicated RM	83.7	68.8	54.6	75.3	Strong on math/reasoning; weaker on general chat
18	Auto-J 13B	Dedicated RM	80.2	64.5	52.3	73.8	Generative judge; produces natural-language critiques

RB v1 = RewardBench v1 overall score. RB-2 = RewardBench 2 overall score. JudgeBench = human-agreement score from JudgeBench paper. MT-Judge = MT-Bench judge agreement percentage. Frontier LLM scores on RB v1 and RB-2 are not reported in public leaderboard data - these models were not submitted to the preference-pair evaluation protocol. Scores current as of April 2026.

Evaluation pipeline showing prompts flowing through reward model scoring and judge agreement RewardBench evaluates models across five categories: Chat, Chat-Hard, Safety, Reasoning, and Code. Dedicated RMs and LLM judges show different strength profiles across these categories. Source: pexels.com

Key Findings

Dedicated RMs Can Beat Frontier Judges on Preference Accuracy

On RewardBench v1, Skywork-Reward-Gemma-2-27B scores 96.1% - a result that no frontier judge approaches in the RewardBench evaluation protocol. This is because dedicated RMs are trained end-to-end to solve exactly the problem RewardBench measures: given a prompt and two candidate responses, pick the better one. Frontier judges were trained for conversation, not preference discrimination.

This distinction matters for RLHF pipeline design. If your goal is high-accuracy preference labeling at training time, a purpose-built 27B RM running cheaply on your own hardware may outperform $0.05-per-call GPT-4.1 judgments in head-to-head accuracy - not just in cost efficiency.

Reasoning Judges Are the Best Option for Math and Code

On JudgeBench's math and coding subcategories, o3 leads by a clear margin. Chain-of-thought reasoning allows the model to verify mathematical derivations and trace code execution before making a judgment - catching errors that a direct scoring call misses. For evaluation pipelines specifically targeting math problem-solving or code correctness, the cost premium of a reasoning judge is often justified.

For general quality assessment across writing, instruction following, and conversational tasks, Claude 4 Opus and GPT-5 are the more cost-efficient frontier options.

Judge Self-Bias is Real and Measurable

GPT-5 judges GPT-5 outputs roughly 8-12% higher than equivalent outputs from other models, when blind identity is removed. This is documented in JudgeBench analysis and consistent with findings from the original MT-Bench paper. The implication: if your evaluation pipeline uses GPT-5 to evaluate candidates that may include GPT-5 outputs, you're not running a fair comparison. Claude 4 Opus shows lower measured self-bias, making it a better choice for evaluations spanning multiple providers.

Position Bias in Pairwise Judging

When LLM judges evaluate two responses side by side, they show a consistent preference for the response presented first (primacy bias) or last (recency bias), independent of actual quality. The strength of this bias varies by model: Gemini 2.5 Pro shows moderate primacy bias; GPT-4.1 shows recency bias in longer contexts. Standard mitigation is to run each pair twice with positions swapped and average the result - at the cost of doubling inference calls.

Length Bias Inflates Scores for Verbose Responses

All judge models in this table show some degree of length bias - a tendency to prefer longer, more detailed responses over shorter, equally correct ones. RewardBench v1's relatively high average scores partly reflect this bias being baked into the training distribution: the "chosen" responses in most RLHF datasets are statistically longer than the "rejected" ones. RewardBench 2 was specifically designed to reduce this confound, which is why average scores drop 8-12 points across the board.

Skywork-Reward-Gemma-2-27B scores 96.1% on RewardBench v1 - beating every frontier judge in the dedicated preference-pair evaluation protocol at a fraction of the inference cost.

Open-Source RMs Show RewardBench Saturation Signs

Several dedicated RMs now exceed 95% on RewardBench v1. AllenAI has noted that benchmark construction artifacts - particularly the statistical predictability of chosen/rejected pairs from the source datasets - allow models to achieve high scores without genuinely calibrated preference judgment. This is why RewardBench 2 was developed, and why the score gap between models on RB-2 (12 points between ranks 4 and 18) is more informative than the compressed top of RB v1.

Methodology Notes

Dedicated RM scores are taken from the live RewardBench v1 leaderboard maintained by AllenAI and from published RewardBench 2 results in the RewardBench 2 paper. All scores represent the model's percentage accuracy at identifying the preferred response in chosen/rejected pairs, averaged across all five benchmark categories.

JudgeBench scores are from the JudgeBench paper (arXiv:2404.13512), which reports human-agreement percentage across 350 expert-labeled examples. Frontier LLM scores were reported in the paper's LLM-as-judge evaluation section; dedicated RM scores required adapting the pointwise format used in JudgeBench for comparison.

MT-Bench Judge Agreement scores are sourced from individual model cards and the original MT-Bench paper. This metric measures agreement with GPT-4 Turbo judgments, not direct human labels - treat it as a relative signal, not an absolute quality measure.

Frontier LLMs do not appear on RewardBench v1 or v2 because the benchmark requires models to be submitted as reward scorers (returning a scalar), not as conversational judges. Applying frontier LLMs in the RewardBench protocol would require significant adaptation; reported scores use the models in their natural judge-prompting mode across compatible benchmarks.

Caveats

Judge agreement is not the same as human truth. A model that agrees 85% of the time with human labels may still be systematically wrong on specific domains, languages, or task types that happen to be underrepresented in the benchmark. Measure agreement on your specific data distribution before trusting leaderboard scores for production decisions.

Open-source RMs overfit to RewardBench construction. The fact that multiple RMs exceed 95% on RB v1 while dropping 15-20 points on RB-2 is a clear signal of benchmark overfitting. Models have been fine-tuned on distributions that happen to overlap with RB v1's source datasets. RewardBench 2 is the more reliable signal.

RMs are only as good as their preference training data. Reward model quality is fundamentally bounded by the quality and diversity of human preference labels used to train it. Models trained on English-only, US-centric preference data will produce miscalibrated scores on multilingual outputs or culturally specific tasks. InternLM2-Reward-7B's strong performance on Chinese-language preference pairs (not reflected in aggregate RB scores) illustrates that the training distribution matters as much as architecture.

Contamination risk for newer benchmarks. JudgeBench examples are not public, which reduces contamination risk. RewardBench v1's source datasets are partially public, creating real contamination risk for models trained on large web-crawled preference datasets. Scores on contaminated benchmarks overstate real-world judge quality.

Cost and latency are not captured in these rankings. A 7B dedicated RM can run 1,000+ comparisons per minute on an A10G GPU. GPT-5 judgment at the same throughput would cost orders of magnitude more. For high-volume training-time preference labeling, cost and latency constraints almost always dominate accuracy differences at the margin.

Practical Recommendations

For RLHF training pipelines where you need high-volume, cost-efficient preference labels: Skywork-Reward-Gemma-2-27B is the current best open-weights option. QRM-Llama3.1-8B is the best choice under 10B parameters if memory is the binding constraint.

For automated evaluation of general model outputs (writing, instruction following, conversational quality): Claude 4 Sonnet is the most balanced frontier judge - strong JudgeBench performance, documented low self-bias, and lower cost than Claude 4 Opus.

For math or code evaluation pipelines where error detection matters: o3 as a reasoning judge with chain-of-thought output is worth the cost premium. The quality delta on structured tasks is significant.

For multi-provider evaluation where self-bias is a concern: Claude 4 Opus shows the lowest documented self-bias among frontier judges and is the preferred choice when GPT-5 or OpenAI models are among the candidates being evaluated.

For broader context on how model rankings translate to real-world quality, see the Chatbot Arena Elo Rankings and the Instruction Following Leaderboard. If you're building evaluation infrastructure and need visibility into judge behavior in production, AI observability tools can help trace judge calls and detect drift in agreement patterns over time.

FAQ

What is a reward model and how does it differ from a regular LLM?

A reward model is trained specifically to score the quality of language model outputs according to human preferences. Unlike general-purpose LLMs that generate text, reward models output scalar scores or rankings. They're smaller and much cheaper to run than frontier models, but they only do one job well.

Can I use GPT-5 as a reward model replacement during RLHF?

Technically yes, but the economics rarely work out. GPT-5 as a judge costs roughly $0.05-0.15 per comparison. A good 7B dedicated RM on your own GPU costs a fraction of a cent per comparison. For training-time labeling at millions of rollouts, frontier judge costs become prohibitive. Dedicated RMs are the right tool for training pipelines.

What is position bias in LLM judging?

Position bias refers to a judge model systematically preferring whichever response is presented in a particular position - first or last - in a pairwise comparison, regardless of actual quality. It's well-documented across all tested LLM judges. The standard mitigation is to evaluate each pair twice with positions swapped and average the scores.

Why do dedicated RM scores drop so much from RewardBench v1 to v2?

RewardBench v1 has artifacts in its construction - the "chosen" responses tend to be statistically predictable from source datasets that overlap with RM training distributions. RewardBench 2 was designed to remove these cues, requiring genuine preference understanding rather than surface-level pattern matching. The 10-15 point drop is expected and reflects more honest measurement.

Is JudgeBench better than RewardBench for measuring real-world judge quality?

For evaluating LLM judges specifically, yes - JudgeBench uses expert human labels rather than crowdworker preference data, and it's designed for the prompting paradigm rather than the preference-pair paradigm. For dedicated RMs used in training pipelines, RewardBench remains the primary benchmark because it matches their operating mode.

How often does this leaderboard update?

The RewardBench v1 and v2 leaderboards update continuously as new models are submitted to AllenAI's evaluation infrastructure. JudgeBench scores are static to the paper; new models require the original authors to run evaluations. This article's scores reflect publicly reported data as of April 2026.

Sources: