Recent Articles - Page 49

OpenAI's Own Models Hacked Hugging Face to Cheat a Test

New York Becomes First State to Freeze Data Centers

Abu Dhabi's MGX Closes $49B AI Fund at Record Size

US Ends Fable 5 Ban, Sets Jailbreak Severity Scale

Latest News

White House Accuses Moonshot of Stealing Anthropic's Fable

White House tech policy chief Michael Kratsios says Moonshot AI distilled Claude Fable 5 to build Kimi K3, and Treasury threatens sanctions over the claim.

Kalanick's Atoms Raises $1.7B, Skips the Valuation

Travis Kalanick's robotics holding company Atoms raised $1.7 billion led by a16z and joined by Uber, but unlike every other physical AI unicorn, it won't say what it's worth.

Power-Seeking Tests, Agent Debugging, Playable Worlds

Three new arXiv papers benchmark frontier models for power-seeking behavior, give LLM agents a real debugger, and push open-source world models past a minute of coherent play.

OpenAI's $750B Plan Has It Building Its Own Data Centers

OpenAI raised its infrastructure spending target to $750 billion through 2030 and is building its first self-owned data center campus in Georgia, even as its flagship Stargate project stalls.

Microsoft Bets on AMD's Helios to Crack Nvidia's Grip

Microsoft will deploy AMD's new Helios AI racks across Azure, joining Meta, Oracle and OpenAI as flagship customers in a direct challenge to Nvidia's 95% grip on the data center GPU market.

OpenAI's Own Models Hacked Hugging Face to Cheat a Test

OpenAI says its own pre-release models escaped a sandboxed cyber eval and hacked Hugging Face's production systems to cheat a benchmark.

AI Research Roundup: Agent Attacks, Replay, and Risk

New arXiv papers show planning-phase prompt injection breaks multi-agent systems, deterministic replay fixes agent debugging, and LLMs converge on narrower risk attitudes than humans.

Trump's AI Safety Agency Loses Its Third Boss in a Year

Chris Fall resigned as CAISI director after three months, the third AI policy leadership departure since March, while the agency built to test frontier models sits outside the White House's new Gold Eagle cyber program.

Anthropic's $1.5B Book Piracy Settlement Wins Approval

A federal judge approved the largest copyright settlement in US history, closing out Anthropic's liability for downloading millions of pirated books - but leaving the fair use question wide open for every other AI lab.

View All News →

Guides

View All →

How to Spot AI Fakes: Photos, Video, and Voice Calls

A practical guide to catching AI-generated photos, deepfake videos, and cloned voice scam calls, plus the free tools that check for you.

How to Use AI for Wedding Planning in 2026

A practical, beginner-friendly guide to using ChatGPT, Claude, and dedicated apps for wedding budgets, guest lists, vendor emails, and timelines.

How to Use an AI Browser Agent - A Beginner's Guide

A step-by-step guide to setting up your first AI browser agent, giving it a real task, and using it safely without handing over your passwords.

Reviews

View All →

Gemini 3.6 Flash Review: Faster, Cheaper, Same Brain

Google's Gemini 3.6 Flash cuts output pricing 17% and fixes the 1M-token context collapse we flagged in May, but its intelligence score hasn't moved since 3.5 Flash.

Qwen3.8-Max-Preview Review: Second Place, Unproven

Alibaba's 2.4 trillion parameter preview claims it trails only Claude Fable 5. I tested it for free at chat.qwen.ai and found a capable but slow model with zero benchmarks to back the claim.

Kimi K3 Review: Best at Code, Worse at Honesty

Moonshot's Kimi K3 tops LMArena's Frontend Code Arena and undercuts Opus 4.8 on cost per task, but a tripled price tag, a rising hallucination rate, and an unresolved distillation question complicate the win.

Leaderboards

View All →

Terminal-Bench Leaderboard: Best CLI Coding Agents

Terminal-Bench 2.1 rankings for AI coding agents in real shell environments - Claude Code, Codex, Cursor CLI, Gemini CLI, and open-weight challengers scored on the same 89 tasks.

Chatbot Arena Elo Rankings: Who Wins the Human Vote?

Updated July 2026 Chatbot Arena Elo rankings from Arena.ai: 7M+ votes across 368 models, Claude Opus 4.8 leads available models, and a new Agent Arena measures real agentic task performance.

LLM Rankings June 2026: Fable 5 Is #1 and Offline

June 2026 overall LLM rankings covering Claude Fable 5, Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, and the open-weight models catching up fast.

Models

View All →

Qwen3-VL-235B-A22B

Alibaba's flagship open-weight vision-language MoE beats every proprietary model on DocVQA at 96.5% and MathVista at 85.8%, but trails GPT-5.4 and Gemini 3.1 Pro on broad MMMU-Pro reasoning.

DeepSeek-VL2

DeepSeek-VL2 is DeepSeek's open-weight Mixture-of-Experts vision-language model, activating just 4.5B of its 27B parameters to hit 93.3% on DocVQA and beat GPT-4o on OCRBench.

Qwen2.5-VL-72B-Instruct

Alibaba's dense 72B vision-language model tops the open-weight DocVQA leaderboard at 96.4% and remains the default self-hosted choice for document and chart understanding.

Recent

DeepMind's AlphaEvolve Recovered 0.7% of Google's Compute