Articles Tagged "LLM"

Agent Phase Collapse, Reasoning Exits, Preference Gaps

Three new arXiv papers map capability cliffs in agent world models, the narrow benefit of learned reasoning stops, and a 56% accuracy ceiling when agents help users build preferences.

Agent Languages, Sampling Ceilings, and Abstention

Three new papers on agents inventing symbolic languages to cut reasoning tokens by 3-6x, sampling ceilings that waste inference compute, and context-engineering to double agentic abstention rates.

Claude Sonnet 5

Anthropic's latest Sonnet-class model brings near-Opus coding performance to mid-tier pricing, with major agentic search and computer use gains over Sonnet 4.6.

Claude Sonnet 5 Is Anthropic's New Agentic Default

Anthropic's Claude Sonnet 5 becomes the default model across all plans, promising near-Opus agentic performance at a third less than Sonnet 4.6's standard price.

Tandem Training, World Models, and Efficient Agents

Three new arXiv papers on making RL reasoning legible across models, fixing broken world model latent states, and training small agents to beat their teachers.

Grok 4.5

Grok 4.5 is xAI's 1.5-trillion-parameter V9 model in private beta at SpaceX and Tesla, with supplemental training on Cursor coding data and early evals claiming performance near Claude Opus 4.8.

Refusal Gaps, Prompt Bleed, and Scaling's Logic Limit

Three new papers reveal how LLM safety hinges on persona training, how prompt modules interfere in deployed agents, and why scaling alone cannot reach symbolic reasoning.

Grok 4.3 Review: xAI Bets on Price Over Prestige

Grok 4.3 slashes prices by up to 83%, adds native video input and voice cloning, and carves out a credible position as the most cost-efficient frontier model - with real caveats on coding and latency.

Quantization's Hidden Tax, Cliff Tokens, Smarter Memory

Three new arXiv papers reveal hidden costs in quantized reasoning models, single-token failure triggers, and a new framework that cuts agent memory errors by up to 79%.

AI Diagnosis, Cache Efficiency, and Agent Security

Three papers from today's arXiv: a 32B medical model beats DeepSeek-R1 in rare disease diagnosis, a KV cache method keeps 97% accuracy with 3% memory, and a new benchmark red-teams agentic AI systems.

Best LLM Gateways 2026 - LiteLLM, Portkey, and 5 More

A hands-on comparison of seven LLM gateway and routing tools - LiteLLM, Portkey, Helicone, OpenRouter, Martian, Cloudflare AI Gateway, and Bifrost.

ERNIE 5.1

Baidu's ERNIE 5.1 is a text-focused MoE model that claims the top Chinese model slot on LMArena with 800B parameters built at 6% of comparable training costs.