Articles Tagged "Benchmarks"

GPT-5.4 mini

OpenAI's mid-range model in the GPT-5.4 family delivers near-flagship coding and agentic performance at $0.75/M input tokens with a 400K context window.

AI Research Bias, Deception Probes, and Code Exploits

AI agents reproduce 72% of human research ideological bias, lie detectors improve with model scale, and Mastermind beats iterative vulnerability agents by 7 points.

LongCat-2.0 Review: China's Stealth Coder

Meituan's 1.6T open-source coding model secretly topped OpenRouter for two months before revealing itself - and the price-to-performance math is hard to argue with.

Agent Safety Gaps, Memory Learning, and Leaner Inference

Three new papers expose how production agent frameworks fail under attack, why RLVR training discards useful cross-episode signals, and how calibrated confidence cuts inference compute by 12x.

GPT-5.6 Sol Review: Strong Model, Thin Access

OpenAI's GPT-5.6 Sol tops Terminal-Bench 2.1 at 91.9% with its multi-agent Ultra mode, but reward-hacking findings and government-gated access keep it out of reach for nearly everyone.

LongCat-2.0

Meituan's 1.6T-parameter open-source MoE coding model, trained end-to-end on 50,000 domestic Chinese ASICs, with native 1M token context and a 59.5 SWE-bench Pro score.

Meituan's LongCat-2.0 Was Topping OpenRouter in Disguise

Meituan open-sources LongCat-2.0, a 1.6T MoE model trained on 50,000 Chinese ASICs that secretly topped OpenRouter under the alias Owl Alpha.

Science Agents, Jailbreak Defense, and Open-World Failures

Three papers from today's arXiv: graph-native RL generates traceable scientific hypotheses, HARC defeats jailbreaks by coupling internal safety directions, and ICML 2026's OpenAgent shows how distributional shift breaks tool-use agents.

Holo3-35B-A3B

H Company's open-weight sparse MoE vision-language model purpose-built for desktop computer use, scoring 82.6% on OSWorld-Verified with only 3B active parameters.

Best AI for Web Browsing and Computer Use - July 2026

Claude Fable 5 leads OSWorld-Verified at 85% after its 19-day US suspension ended July 1 - Holo3 open-source at 82.6% and Claude Sonnet 5 at $2/M tokens reshape the value calculus.

Agent Phase Collapse, Reasoning Exits, Preference Gaps

Three new arXiv papers map capability cliffs in agent world models, the narrow benefit of learned reasoning stops, and a 56% accuracy ceiling when agents help users build preferences.

Claude Sonnet 5 Review: Near-Opus at Half the Price

Anthropic's Sonnet 5 is the first mid-tier model that genuinely competes with Opus-class agents on coding and computer use, released June 30 at $2/$10 per million tokens.

← Previous