Articles Tagged "Benchmarks"

OpenAI Launches GPT-5.5 for Agents and Work

OpenAI's first fully retrained base model since GPT-4.5 ships today to ChatGPT and Codex, leading on Terminal-Bench 2.0 at 82.7% with a doubled per-token price.

Embedding Model Leaderboard: MTEB Rankings April 2026

April 2026 rankings of the top embedding models by MTEB score - Gemini Embedding 001, NV-Embed-v2, Qwen3-Embedding-8B, and the new Jina v4 multimodal release compared for RAG and search.

MIT's Recursive Language Models Bypass the Context Ceiling

MIT researchers show that treating long documents as a Python environment - and letting models recursively spawn sub-models to explore them - beats RAG and extended context windows on every benchmark tested.

Best AI Models for Math Reasoning - April 2026

Gemini 3.1 Pro leads GPQA Diamond at 94.1% and HLE at 44.7% as AIME 2025 saturates; Claude Opus 4.7 and Kimi K2.6 join the top tier in April 2026.

Bad Science, Poisoned Tools, and Aligned Reasoning

Three new papers show AI scientific agents skip evidence, tool-integrated agents are vulnerable to adversarial poisoning, and reasoning model safety can be fixed with 1,000 examples.

Qwen3.6-27B

Qwen3.6-27B is a 27B dense open-weight multimodal model from Alibaba that scores 77.2% on SWE-bench Verified - beating Alibaba's own 397B MoE - under Apache 2.0.

GLM-5.1

Z.ai's GLM-5.1 is an open-weight 754B MoE model that tops SWE-Bench Pro with 58.4, sustains 8-hour autonomous coding sessions, and runs under MIT license at $0.95/M input tokens.

Leaner Reasoning, Fragile Agents, and Model Self-Audit

Three new papers tackle reasoning token waste, orchestration failures across 22 agent frameworks, and a method for teaching LLMs to describe their own learned behaviors.

ERNIE 5.0

Baidu's ERNIE 5.0 combines 2.4 trillion parameters with native omni-modal design, landing at LMArena's top-10 globally and outpacing GPT-5 High on chart and document benchmarks.

EXAONE 4.5

LG AI Research's first open-weight vision-language model packs 33B parameters, 262K context, and STEM scores above GPT-5-mini - but ships under a non-commercial license.

Qwen3.6-Max-Preview

Alibaba's first closed-weights flagship Qwen ships with a 256K context window, tops six agentic coding benchmarks, and ranks third on the Artificial Analysis Intelligence Index.

GPT-Rosalind

OpenAI's first domain-specific reasoning model for biology and drug discovery, launched April 16 2026 as a US-only research preview with a 0.751 BixBench score.

← Previous