Articles Tagged "LLM"

Medical LLM Leaderboard 2026: MedQA, USMLE, PubMedQA

Rankings of AI models on medical QA benchmarks - MedQA USMLE, MedMCQA, PubMedQA, MMLU-Medical, HealthBench, and more. Where a wrong answer has clinical consequences.

Reward Model and LLM-as-Judge Leaderboard 2026 Ranked

Rankings of dedicated reward models and frontier LLMs as judges across RewardBench, RewardBench-2, and JudgeBench - benchmarks that measure preference alignment and human agreement.

Scientific Reasoning LLM Leaderboard 2026: GPQA Ranks

Rankings of AI models on STEM benchmarks: GPQA Diamond, SciBench, OlympiadBench-Science, MMLU-STEM, ARC-Challenge, and ChemQA/Physics Olympiad as of April 2026.

Structured Output JSON Schema Leaderboard 2026

Rankings of LLMs and constrained decoding frameworks on JSON schema adherence benchmarks including JSONSchemaBench and BFCL v3, covering native APIs and open-source constraint engines.

Summarization LLM Leaderboard 2026: ROUGE and Faithfulness

Rankings of the top LLMs on summarization benchmarks - ROUGE-L, BERTScore, FActScore, and human preference across CNN/DailyMail, XSum, GovReport, QMSum, and BookSum as of April 2026.

Text-to-SQL LLM Leaderboard 2026: Spider and BIRD Ranked

Rankings of the best LLMs and agent pipelines on BIRD, Spider 2.0, CoSQL, and SParC text-to-SQL benchmarks, with execution accuracy scores and analysis.

MoE Routing, Prompt Gambles, and Where Reasoning Breaks

Three new papers challenge assumptions in MoE routing design, prompt optimization workflows, and LLM reasoning chains - all published this week on arXiv.

Hallucination Benchmarks Leaderboard: April 2026

Rankings of the top AI models on factuality and hallucination benchmarks: TruthfulQA, SimpleQA, FACTS Grounding, Vectara HHEM, HaluEval, HalluLens, and AA-Omniscience as of April 2026.

Function Calling Benchmarks Leaderboard 2026

Rankings of top LLMs on function calling and tool use benchmarks including BFCL v3, tau-bench, ToolBench, and FinTrace as of April 2026.

OpenAI Releases GPT-Rosalind for Drug Discovery

OpenAI launched GPT-Rosalind on April 16, a frontier reasoning model for drug discovery that outranked human experts on RNA prediction and competes directly with Google DeepMind's AlphaFold.

LLM Chaos, AI Peer Review, and Auto Fine-Tuning

Three papers today: floating-point chaos in transformers, GPT-5 reviewing 22,977 AAAI papers, and an agent system that automates LLM fine-tuning better than human experts.

Arcee Trinity

Arcee Trinity-Large-Thinking is a 400B sparse MoE open-source reasoning model that ranks #2 on PinchBench at $0.85/M output tokens, 28x cheaper than Claude Opus 4.6.

← Previous