Articles Tagged "Benchmarks"

LongCat-2.0

Meituan's 1.6T-parameter open-source MoE coding model, trained end-to-end on 50,000 domestic Chinese ASICs, with native 1M token context and a 59.5 SWE-bench Pro score.

Meituan's LongCat-2.0 Was Topping OpenRouter in Disguise

Meituan open-sources LongCat-2.0, a 1.6T MoE model trained on 50,000 Chinese ASICs that secretly topped OpenRouter under the alias Owl Alpha.

Science Agents, Jailbreak Defense, and Open-World Failures

Three papers from today's arXiv: graph-native RL generates traceable scientific hypotheses, HARC defeats jailbreaks by coupling internal safety directions, and ICML 2026's OpenAgent shows how distributional shift breaks tool-use agents.

Holo3-35B-A3B

H Company's open-weight sparse MoE vision-language model purpose-built for desktop computer use, scoring 82.6% on OSWorld-Verified with only 3B active parameters.

Best AI for Web Browsing and Computer Use - July 2026

Claude Fable 5 leads OSWorld-Verified at 85% after its 19-day US suspension ended July 1 - Holo3 open-source at 82.6% and Claude Sonnet 5 at $2/M tokens reshape the value calculus.

Agent Phase Collapse, Reasoning Exits, Preference Gaps

Three new arXiv papers map capability cliffs in agent world models, the narrow benefit of learned reasoning stops, and a 56% accuracy ceiling when agents help users build preferences.

Claude Sonnet 5 Review: Near-Opus at Half the Price

Anthropic's Sonnet 5 is the first mid-tier model that genuinely competes with Opus-class agents on coding and computer use, released June 30 at $2/$10 per million tokens.

Chatbot Arena Elo Rankings: Who Wins the Human Vote?

Updated July 2026 Chatbot Arena Elo rankings from Arena.ai: 7M+ votes across 368 models, Claude Opus 4.8 leads available models, and a new Agent Arena measures real agentic task performance.

Claude Sonnet 5

Anthropic's latest Sonnet-class model brings near-Opus coding performance to mid-tier pricing, with major agentic search and computer use gains over Sonnet 4.6.

Claude Sonnet 5 Is Anthropic's New Agentic Default

Anthropic's Claude Sonnet 5 becomes the default model across all plans, promising near-Opus agentic performance at a third less than Sonnet 4.6's standard price.

Grok 4.5 Enters Beta with Unverified Opus Claims

Elon Musk announced Grok 4.5 is in private beta at SpaceX and Tesla, claiming it rivals Claude Opus - but the only benchmarks cited are internal ones run at Musk's own companies.

GLM-5.2 Review: Best Open-Weight Coder at 1/6 Cost

Z.ai's GLM-5.2 delivers frontier coding performance with open weights and MIT license at roughly one-sixth the cost of GPT-5.5 - but can it replace Claude Opus 4.8?