Articles Tagged "Reasoning"

Autonomous Research, Broken Reasoning, Smarter Agents

Three new papers: AlphaLab runs autonomous GPU research campaigns, open-weight reasoning models collapse under text reformatting, and HiL-Bench reveals agents can't decide when to ask for help.

Grok 4.20 Review: Four Minds Are Better Than One

xAI's Grok 4.20 replaces the single-model approach with four specialized agents that debate before every answer - a bold architectural bet that pays off in some areas and stumbles in others.

Arcee's Trinity-Large: 398B Open Reasoning at $0.90

Arcee AI ships Trinity-Large-Thinking, a 398B sparse MoE reasoning model under Apache 2.0 that hits 91.9% on PinchBench for $0.85 per million output tokens on OpenRouter.

Clinical AI Harm, Smarter Reasoning, and Safer Agents

Three papers: AI safety measures withhold critical clinical guidance from patients, SAT cuts reasoning tokens by 40%, and conformal prediction blocks wrong multi-agent consensus.

EXAONE 4.5: LG's Open VLM Beats GPT-5-mini on STEM

LG AI Research released EXAONE 4.5, a 33B open-weight vision-language model that posts higher STEM scores than GPT-5-mini and Claude 4.5 Sonnet - but a non-commercial license caps its real-world reach.

Blind Refusal, Broken Steps, and Free Uncertainty

Three papers expose safety training's moral blind spot, two distinct failure modes inside reasoning models, and a 10x cheaper way to know when a reasoning model is guessing.

Muse Spark

Meta's first closed-source frontier model scores 52 on the Artificial Analysis Intelligence Index, leads on HealthBench Hard, and ships free at meta.ai - but has no public API yet.

Meta Muse Spark Launches, Ranks 4th Among Frontier Models

Meta Superintelligence Labs releases Muse Spark, its first AI model built from scratch in nine months, landing 4th on the Artificial Analysis Intelligence Index.

MedGemma 1.5, Smarter MCTS, and Auditing AI Agents

Google's MedGemma 1.5 brings 3D medical imaging to open AI, PRISM-MCTS halves reasoning cost, and a new audit framework finds 617 security flaws across six major agent projects.

Google Gemma 4

Gemma 4 is Google DeepMind's most capable open model family: four variants from 2B to 31B, Apache 2.0 license, multimodal across text/image/video/audio, and the 31B Dense ranking #3 on Chatbot Arena against all open-weight models globally.

Decisions Before Thinking, Smaller RL Models, Agent Collusion

Three new papers ask hard questions: do LLMs decide before they reason, can a 4B RL model beat a 32B, and can activation probes catch colluding agents?

Grok 4.20

Grok 4.20 is xAI's current flagship LLM with a 2M-token context window, native multi-agent mode, and reasoning toggle at $2.00/M input tokens.

← Previous