Articles Tagged "AI Safety"

New York's RAISE Act Is Law - AI Labs Have Until 2027

New York's RAISE Act is now on the books, requiring frontier AI developers to publish safety protocols, report incidents within 72 hours, and submit to annual audits by January 2027.

Jensen Huang Says AGI Is Here - The Evidence

Nvidia's CEO told Lex Fridman he thinks AGI has been achieved. We checked the claim against its own definition, the research consensus, and what billions of dollars in legal agreements actually say.

OpenAI Foundation Names Leaders, Pledges $1B

OpenAI's nonprofit arm announced a $1 billion grant commitment for 2026, hired a full leadership team including co-founder Wojciech Zaremba, and outlined four focus areas from disease research to children's mental health.

Interpretability Limits, Dark Models, Persona Traps

Three new papers expose a gap between what AI models know and what they do - and why that gap is harder to close than anyone assumed.

Anthropic's 81K Study: AI Hopes, Fears, and the Gap

Anthropic's largest qualitative study of 80,508 users across 159 countries reveals the gap between what people hope AI will do and what it actually delivers.

Multi-Agent Constitution, Sleeper Defense, Skill RL

Three new arXiv papers tackle constitutional AI rule learning, sleeper agent defense for multi-agent pipelines, and skill-evolving reinforcement learning for math reasoning.

Linux Foundation Raises $12.5M Against AI Bug Slop

Seven AI and cloud companies pool $12.5M through OpenSSF and Alpha-Omega to build tools that help open-source maintainers cope with a flood of AI-generated vulnerability reports they can't triage.

Enterprise Agents Stall, Safety Gates, Smarter Tool Use

New research shows enterprise AI agents top out at 37.4% success, a deterministic safety gate beats commercial solutions, and an ICLR 2026 paper cuts RL compute by 81%.

AI Models Are Gaming Safety Evaluations, Report Warns

The International AI Safety Report 2026, led by Yoshua Bengio with 100+ experts from 30+ countries, finds frontier models increasingly detect test conditions and behave differently in real deployment - undermining pre-deployment safety evaluation.

JBDistill Generates Its Own Jailbreaks - 81.8% Attack Rate

Johns Hopkins and Microsoft's JBDistill achieves 81.8% attack success rate across 13 LLMs by auto-generating fresh adversarial prompts on demand.

Reasoning Traps, LLM Chaos, and Steering Curves

Three papers this week: why better reasoning creates safety risks, why multi-agent systems behave chaotically even at zero temperature, and why straight-line activation steering is broken.

Anthropic Launches Institute as Powerful AI Looms

Anthropic has consolidated its red team, societal impacts, and economic research teams into a new body called the Anthropic Institute, warning that extremely powerful AI is arriving faster than most expect.

← Previous