Articles Tagged "Benchmarks"

Anthropic Releases Fable 5 Despite Its Own AI Safety Warning

Anthropic opens Mythos-class capabilities to the public with Claude Fable 5 at $10/$50 per million tokens, days after calling for a global AI pause.

MCP Exploit Risk, Sycophancy Scores, and Agent Self-Harm

New research reveals MCP error messages triple agent attack success rates, ranks eight models on sycophancy with Claude scoring best, and finds self-evolving agents make 30-42% false edits.

MAI-Code-1-Flash

Microsoft's first in-house coding model, a 137B sparse MoE built natively for GitHub Copilot, beating Claude Haiku 4.5 on SWE-Bench Pro by 16 points.

Claude Opus 4.8 Leads SWE-Bench Pro, Adds Parallel Agents

Anthropic's Claude Opus 4.8 scores 69.2% on SWE-bench Pro and ships hundreds of parallel subagents in Claude Code, with pricing unchanged at $5 per million input tokens.

Safety Evals Break Under Attack, Agents Work 87% Faster

Three papers: strategic attack timing exposes gaps in AI control evaluations, Perplexity's agents slash task time by 87%, and Lean4 formal proofs make agent workflows more reliable.

Ministral 3 8B

Mistral AI's mid-tier open-weight edge model - 8B parameters, 256K context, Apache 2.0 license, built for agentic pipelines and cost-sensitive production workloads.

Devstral 2

Mistral's open-weight coding agent model - 123B parameters, 256K context window, 72.2% on SWE-bench Verified, priced at $0.40/M input tokens.

Grok Build 0.1

Grok Build 0.1 is xAI's first model built specifically for agentic coding workflows, with a 256K context window, native MCP support, and always-on reasoning at $1/M input tokens.

Ministral 3 14B

Mistral AI's largest Ministral 3 model - 14B parameters, 256K context, Apache 2.0 license, multimodal, built for local deployment and agentic workflows.

GPT-Rosalind Review: The Gated Drug Discovery Model

OpenAI's life sciences reasoning model gets a June update with global access and new NGS plugins - strong benchmarks, but still locked behind a Trusted Access Program with no public pricing.

MiniMax M3 Makes 1M Context Viable With Sparse Attention

MiniMax M3 uses sparse attention to cut long-context inference cost 20x, topping GPT-5.5 on coding benchmarks at a fraction of the price.

NVIDIA Nemotron 3 Ultra 550B-A55B

NVIDIA's 550B open-weight MoE model with 55B active parameters, hybrid Mamba-Transformer architecture, and 1M token context - the top-scoring US open model on the Artificial Analysis Intelligence Index.

← Previous