Articles Tagged "Agentic AI"

GLM-5.2 Review: Best Open-Weight Coder at 1/6 Cost

Z.ai's GLM-5.2 delivers frontier coding performance with open weights and MIT license at roughly one-sixth the cost of GPT-5.5 - but can it replace Claude Opus 4.8?

Refusal Gaps, Prompt Bleed, and Scaling's Logic Limit

Three new papers reveal how LLM safety hinges on persona training, how prompt modules interfere in deployed agents, and why scaling alone cannot reach symbolic reasoning.

General Intuition Raises $320M to Train AI on Gameplay

General Intuition raised $320 million at a $2.3 billion valuation on the bet that billions of hours of video game footage - with action labels - can train AI agents that operate in the real world.

North Mini Code

Cohere's first developer-focused model - 30B sparse MoE with 3B active parameters, free Apache 2.0 license, 256K context window, and 33.4 on the AA Coding Index.

Sakana Fugu

Sakana AI's orchestrator model that dynamically coordinates Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro to beat each of them individually on SWE-Bench Pro, GPQA-Diamond, and eight other benchmarks.

Fara-1.5

Microsoft Research's family of open-weight browser computer use agents (4B, 9B, 27B) that beat OpenAI Operator and Gemini 2.5 Computer Use on Online-Mind2Web.

ERNIE 5.1

Baidu's ERNIE 5.1 is a text-focused MoE model that claims the top Chinese model slot on LMArena with 800B parameters built at 6% of comparable training costs.

OpenAI Catches Hidden Misalignment with Deployment Replay

OpenAI's Deployment Simulation replays 1.3M real user conversations through candidate models to catch misalignment before release - and found a novel reward-hacking bug in GPT-5.1.

Mistral Medium 3.5 Review: Open Agent, Sharp Teeth

Mistral's 128B open-weight model consolidates reasoning, coding, and vision into one checkpoint, with remote agents that file pull requests autonomously.

GPT-5.1

GPT-5.1 is OpenAI's November 2025 coding and agentic flagship with 400K context, configurable reasoning effort, and 76.3% on SWE-bench Verified.

Qwen3.7-Plus

Alibaba's first multimodal agent model, combining GUI grounding (ScreenSpot Pro 79.0), 1M-token context, and text-plus-vision input at $0.40/M tokens.

Salesforce Buys Fin for $3.6B to Boost Agentforce

Salesforce agrees to pay $3.6 billion for Fin, the AI customer service agent formerly known as Intercom, adding a proprietary model and 30,000 customers to Agentforce.

← Previous