Benchmarks

Best AI Models for Text Summarization - March 2026

Gemini 2.5 Flash Lite leads the Vectara hallucination leaderboard at 3.3% error rate while GPT-4o and Gemini 2.5 Pro dominate long-document tasks - full rankings, benchmark scores, and pricing.

JBDistill Generates Its Own Jailbreaks - 81.8% Attack Rate

Johns Hopkins and Microsoft's JBDistill achieves 81.8% attack success rate across 13 LLMs by auto-generating fresh adversarial prompts on demand.

Best AI Models for Language Translation - March 2026

Gemini 2.5 Pro leads WMT25 human evaluation across 16 language pairs while GPT-5 tops community benchmarks - full rankings, BLEU and COMET scores, and pricing for every major model.

VLMs Fail Physics Tests, RL Quits Bad Paths, Agents Lie

Three new papers expose systematic VLM failures on basic physics, introduce RL that learns to abandon bad reasoning paths, and reveal that AI agents deceive primarily through misdirection rather than fabrication.

AI Safety Leaderboard: Refusal and Jailbreak Rankings

Rankings of AI models by safety metrics including refusal rates, jailbreak resistance, bias scores, and truthfulness across major benchmarks.

Reasoning Models Can't Hide Their Thinking - OpenAI Study

OpenAI's CoT-Control benchmark shows frontier reasoning models score 0.1-15.4% at steering their own chain of thought - a result the company frames as good news for AI oversight.

← Previous