Articles Tagged "Benchmarks"

Vision-Language Benchmarks: Image Reasoning Ranked

Rankings of AI models on the key visual reasoning benchmarks - MMMU, MathVista, ChartQA, DocVQA, OCRBench, AI2D, CharXiv, and more - focused on image and document understanding.

RAG Benchmarks Leaderboard: Retrieval Rankings 2026

Rankings of the top embedding and RAG systems across BEIR, MTEB retrieval, MIRACL, MS MARCO, KILT, HotpotQA, and RAGTruth hallucination benchmarks as of April 2026.

Best AI Transcription Tools 2026: APIs and Services Ranked

A data-driven comparison of the top AI transcription APIs and services for 2026, covering WER accuracy, pricing per hour, speaker diarization, and output formats.

OpenAI Releases GPT-Rosalind for Drug Discovery

OpenAI launched GPT-Rosalind on April 16, a frontier reasoning model for drug discovery that outranked human experts on RNA prediction and competes directly with Google DeepMind's AlphaFold.

Claude Beat Human Alignment Researchers - Then Failed

Nine Claude Opus 4.6 agents outperformed human researchers on a core alignment benchmark, hitting 97% vs 23% in five days - then showed no statistically significant improvement in production.

LLM Chaos, AI Peer Review, and Auto Fine-Tuning

Three papers today: floating-point chaos in transformers, GPT-5 reviewing 22,977 AAAI papers, and an agent system that automates LLM fine-tuning better than human experts.

Arcee Trinity

Arcee Trinity-Large-Thinking is a 400B sparse MoE open-source reasoning model that ranks #2 on PinchBench at $0.85/M output tokens, 28x cheaper than Claude Opus 4.6.

Anthropic's latest flagship model ships with 3x higher resolution vision, a new xhigh effort level, task budgets for cost control, cyber safeguards, and 13% better coding performance at the same $5/$25 pricing.

Claude Opus 4.7 Is Here - Less Supervision, Better Vision

Anthropic releases Claude Opus 4.7 with 3x higher resolution vision, a new xhigh effort level, task budgets for cost control, /ultrareview in Claude Code, and cyber safeguards that automatically block high-risk requests.

Best AI Models for Image Generation - April 2026

GPT Image 1.5 leads Artificial Analysis at 1278 Elo while Nano Banana 2 tops Arena.ai - two leaderboards, two answers, and five new models that reshaped the rankings since March.

Google Ships Gemini 3.1 Flash TTS With 200 Audio Tags

Google's new Gemini 3.1 Flash TTS hits Elo 1,211 on the Artificial Analysis leaderboard and introduces 200-plus audio tags for mid-sentence voice control, available in preview today via the Gemini API.

Best AI Test Generation Tools 2026 - 5 Compared

A hands-on comparison of the top AI unit test generation tools in 2026, covering Qodo Gen, GitHub Copilot, Diffblue Cover, Keploy, and Tusk.

← Previous