Articles Tagged "Reasoning"

Claude Opus 4.8 Review: Reliability Over Raw Scores

Claude Opus 4.8 sets new highs on SWE-bench Pro and long-context tasks while a 4x improvement in code flaw detection may matter more than any benchmark number.

Reasoning Leaks, Hard Limits, and Self-Aware LLMs

Three new papers expose how reasoning traces can be extracted from supposedly hidden model internals, where chain-of-thought hits an architectural ceiling, and how RL teaches models to know when to quit.

Cut CoT Costs, Fix Agent Memory, Test Clinical AI

Three papers: smarter CoT trimming cuts reasoning length by 50%, a plug-in context manager rescues frozen agents on long tasks, and a 960K-item clinical benchmark exposes LLM gaps in hospitals.

Reasoning Capitulation, Faster Guardrails, Curation Risk

Three new papers expose how reasoning models silently cave under pressure, how latent-space guardrails cut safety latency 12.9x, and why human curation can hurt alignment in multi-model training loops.

Claude Opus 4.8

Anthropic's May 2026 flagship model delivers 69.2% on SWE-bench Pro, dynamic parallel workflows in research preview, and Effort Control - all at $5/$25 pricing.

Smarter Trees, Hidden Attacks, Drug Design Gaps

Three new papers cover 4x KV cache savings for tree reasoning, latent-space jailbreaks that bypass safety on 15 models, and GPT-5.4's 40% ceiling on drug design tasks.

OpenAI Disproves 80-Year Erdős Math Conjecture

An internal OpenAI reasoning model produced an original proof disproving the Erdős unit distance conjecture, the first time AI autonomously solved a major open problem in mathematics.

Fix 8% of Tokens, Dodge Memory Attacks, Cut Agent Costs

New research pinpoints the 8% of tokens driving reasoning failures, exposes memory laundering in agent systems, and cuts web agent inference costs 1.9x.

SU-01

SU-01 is a 30B-A3B MoE reasoning model from Shanghai AI Lab that achieves gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using a three-stage training recipe and test-time scaling.

Olympiad Gold, Broken Memories, and Attention Loss

A 30B model earns IMO gold, memory consolidation silently corrupts agents, and a new metric predicts when LLMs lose track of their instructions.

Reasoning Bias, Behavior Cues, and Tool Interpretability

New research shows reasoning length amplifies position bias, behavior cues cut wasted tokens by 50% while boosting safety, and sparse autoencoders can predict tool failures from model internals.

Reasoning Model API Pricing Compared - May 2026

Updated May 2026: DeepSeek V4-Flash reasoning now $0.28/MTok output (8x cheaper than R1), o3-pro launched at $20/$80, Grok 4 retires May 15 - verified pricing across 11 models.

← Previous