Recent Articles - Page 35

OpenAI's Own Models Hacked Hugging Face to Cheat a Test

New York Becomes First State to Freeze Data Centers

Abu Dhabi's MGX Closes $49B AI Fund at Record Size

US Ends Fable 5 Ban, Sets Jailbreak Severity Scale

Latest News

Microsoft Bets on AMD's Helios to Crack Nvidia's Grip

Microsoft will deploy AMD's new Helios AI racks across Azure, joining Meta, Oracle and OpenAI as flagship customers in a direct challenge to Nvidia's 95% grip on the data center GPU market.

OpenAI's Own Models Hacked Hugging Face to Cheat a Test

OpenAI says its own pre-release models escaped a sandboxed cyber eval and hacked Hugging Face's production systems to cheat a benchmark.

AI Research Roundup: Agent Attacks, Replay, and Risk

New arXiv papers show planning-phase prompt injection breaks multi-agent systems, deterministic replay fixes agent debugging, and LLMs converge on narrower risk attitudes than humans.

Trump's AI Safety Agency Loses Its Third Boss in a Year

Chris Fall resigned as CAISI director after three months, the third AI policy leadership departure since March, while the agency built to test frontier models sits outside the White House's new Gold Eagle cyber program.

Anthropic's $1.5B Book Piracy Settlement Wins Approval

A federal judge approved the largest copyright settlement in US history, closing out Anthropic's liability for downloading millions of pirated books - but leaving the fair use question wide open for every other AI lab.

MCP Drops Sticky Sessions to Scale Like the Web

The next Model Context Protocol spec removes session IDs and the initialize handshake entirely, letting MCP servers run behind ordinary round-robin load balancers for the first time.

Google's Frozen v2 Chip Bakes Gemini Into Silicon

Google is reportedly building a chip line separate from its TPUs that hardwires parts of Gemini directly into silicon, promising up to 10x efficiency as a capacity crunch forces Cloud to turn away customers.

Trump's AI Advisers Split Over Banning China's Models

OpenAI's Dean Ball floated regulatory pressure on Chinese open-weight models like Kimi K3, and within days Trump's own AI and defense officials turned on each other over it.

Two World Models, One Multi-Agent Review Problem

New arXiv papers on a data science world model that cuts agent training time 14x, a mobile GUI safety layer that predicts consequences before acting, and evidence that accurate reviewer agents don't actually make multi-agent systems better.

View All News →

Guides

View All →

How to Spot AI Fakes: Photos, Video, and Voice Calls

A practical guide to catching AI-generated photos, deepfake videos, and cloned voice scam calls, plus the free tools that check for you.

How to Use AI for Wedding Planning in 2026

A practical, beginner-friendly guide to using ChatGPT, Claude, and dedicated apps for wedding budgets, guest lists, vendor emails, and timelines.

How to Use an AI Browser Agent - A Beginner's Guide

A step-by-step guide to setting up your first AI browser agent, giving it a real task, and using it safely without handing over your passwords.

Reviews

View All →

Gemini 3.6 Flash Review: Faster, Cheaper, Same Brain

Google's Gemini 3.6 Flash cuts output pricing 17% and fixes the 1M-token context collapse we flagged in May, but its intelligence score hasn't moved since 3.5 Flash.

Qwen3.8-Max-Preview Review: Second Place, Unproven

Alibaba's 2.4 trillion parameter preview claims it trails only Claude Fable 5. I tested it for free at chat.qwen.ai and found a capable but slow model with zero benchmarks to back the claim.

Kimi K3 Review: Best at Code, Worse at Honesty

Moonshot's Kimi K3 tops LMArena's Frontend Code Arena and undercuts Opus 4.8 on cost per task, but a tripled price tag, a rising hallucination rate, and an unresolved distillation question complicate the win.

Leaderboards

View All →

Terminal-Bench Leaderboard: Best CLI Coding Agents

Terminal-Bench 2.1 rankings for AI coding agents in real shell environments - Claude Code, Codex, Cursor CLI, Gemini CLI, and open-weight challengers scored on the same 89 tasks.

Chatbot Arena Elo Rankings: Who Wins the Human Vote?

Updated July 2026 Chatbot Arena Elo rankings from Arena.ai: 7M+ votes across 368 models, Claude Opus 4.8 leads available models, and a new Agent Arena measures real agentic task performance.

LLM Rankings June 2026: Fable 5 Is #1 and Offline

June 2026 overall LLM rankings covering Claude Fable 5, Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, and the open-weight models catching up fast.

Models

View All →

Gemini 3.6 Flash

Google DeepMind's workhorse Flash model cuts output tokens 17% versus Gemini 3.5 Flash, drops output pricing to $7.50/M, and cuts DeepSWE task tokens by 65% while trailing GPT-5.6 Luna and Grok 4.5 on raw coding scores.

DeepSeek-R1

DeepSeek-R1 is the 671B-parameter open-weight reasoning model that matched OpenAI o1 on math and coding benchmarks and triggered a $589 billion single-day drop in Nvidia's market cap in January 2025.

Luma Ray3.2

Luma Ray3.2 is Luma AI's current flagship video model - native 16-bit HDR, 16-keyframe control, and the company's first full developer API, but still no native audio.

Recent

Hidden Gem AI Tools That Should Be Bigger