Articles Tagged "AI Safety"

Bad Science, Poisoned Tools, and Aligned Reasoning

Three new papers show AI scientific agents skip evidence, tool-integrated agents are vulnerable to adversarial poisoning, and reasoning model safety can be fixed with 1,000 examples.

Leaner Reasoning, Fragile Agents, and Model Self-Audit

Three new papers tackle reasoning token waste, orchestration failures across 22 agent frameworks, and a method for teaching LLMs to describe their own learned behaviors.

GPT-5.4-Cyber

OpenAI's GPT-5.4-Cyber is a cyber-permissive fine-tune of GPT-5.4 Thinking with binary reverse engineering, 88.23% on professional CTFs, and access gated through the Trusted Access for Cyber program.

Distillation Leaks, Weak Agents, and Research Sabotage

New papers show distillation silently transfers unsafe behaviors, weak agents bottleneck multi-agent pipelines, and frontier AI can't reliably audit sabotaged ML research.

GitHub Bans Engineer Who Shipped 500 Agent PRs in 72 Hours

A Korean CTO ran a 13-step agent harness against 100+ major open-source repos over three days, landing 500+ commits and 130+ PRs - some merged by Kubernetes, Hugging Face, and Ollama maintainers. Then GitHub banned his account for spam, confirming that platform abuse detection cannot yet tell a disciplined harness from a bot.

Tesla Hid Thousands of Fatal Autopilot Incidents, RTS Says

Swiss broadcaster RTS reopens the 2023 Tesla Files leak in context of the confirmed $243M Miami verdict. The combined record: 2,400+ concealed sudden-acceleration complaints, 1,000+ undisclosed crashes, and a federal court that found Tesla knew.

Stanford 2026 AI Index - Cash In, Transparency Out

Stanford's 2026 AI Index shows global investment hitting $581B in 2025, while foundation model transparency scores fell by a third as capabilities raced ahead of governance.

LLM Jailbreak and Red-Team Resistance Leaderboard

Rankings of 14 frontier LLMs by adversarial robustness - how well they resist jailbreaks, prompt injection, and harmful-behavior elicitation across HarmBench, AdvBench, StrongREJECT, JailbreakBench, and AgentHarm.

World ID 4.0 Brings Human Verification to Tinder and Zoom

Sam Altman's World project launched World ID 4.0 at a San Francisco event on April 17, signing Tinder, Zoom, DocuSign, and Okta as partners while introducing Agent Kit to authorize AI agents.

Claude Beat Human Alignment Researchers - Then Failed

Nine Claude Opus 4.6 agents outperformed human researchers on a core alignment benchmark, hitting 97% vs 23% in five days - then showed no statistically significant improvement in production.

Cal.com Closes Its Source Code, Blames AI Hackers

Cal.com moved its core codebase to a private repo after five years of open source, arguing AI tools make public code 5-10x easier to exploit. The community isn't buying it.

GPT-5.4-Cyber Review: Defensive AI, Controlled Access

OpenAI's GPT-5.4-Cyber is a fine-tuned defensive cybersecurity model with binary reverse engineering, lowered refusal thresholds, and restricted access through the Trusted Access for Cyber program.

← Previous