Articles Tagged "Prompt Injection"

Google AI Overviews Treat 'Disregard' as a Command

Google's new AI Overviews respond to words like 'disregard,' 'ignore,' and 'dismiss' as LLM instructions rather than vocabulary queries, leaving users with blank search results.

AI Security Research and Incident Coverage

Tracking AI supply-chain attacks, agent exploits, prompt injection, model leaks, and the real-world incidents shaping AI security today.

Unsafe Agents, Rising AI Tides, and Training Traps

Three new papers on agent prompt injection attack rates, MIT's broad-based AI automation finding, and a silent normalization-optimizer coupling failure in LLM training.

DeepMind Maps Six Attack Traps Targeting AI Agents

A Google DeepMind paper introduces the first systematic taxonomy of adversarial traps that can hijack autonomous AI agents - and every category already has working proof-of-concept exploits.

JBDistill Generates Its Own Jailbreaks - 81.8% Attack Rate

Johns Hopkins and Microsoft's JBDistill achieves 81.8% attack success rate across 13 LLMs by auto-generating fresh adversarial prompts on demand.

AI Safety Leaderboard: Refusal and Jailbreak Rankings

Rankings of AI models by safety metrics including refusal rates, jailbreak resistance, bias scores, and truthfulness across major benchmarks.

22 Bytes Poison ML Malware Detectors via Label Spoofing

EURECOM researchers show that injecting 22 to 55 bytes into benign Android apps tricks antivirus engines into mislabeling them, poisoning the ML training datasets that millions of researchers depend on.

OBLITERATUS Strips AI Safety From Open Models in Minutes

A new open-source toolkit called OBLITERATUS can surgically remove refusal mechanisms from 116 open-weight LLMs using abliteration - no fine-tuning, no training data, just geometry.

Perplexity's Comet Browser Can Leak Your Local Files

Zenity Labs found that a malicious calendar invite could hijack Perplexity's Comet browser into reading local files and exfiltrating their contents to an attacker-controlled server - no clicks required.

AI Models Can Now Jailbreak Other AI Models Autonomously - 97% Success Rate, No Human Involved

Researchers from Stuttgart and ELLIS Alicante gave four reasoning models a single instruction - 'jailbreak this AI' - and walked away. The models planned their own attacks, adapted in real time, and broke through safety guardrails 97.14% of the time across 9 target models.

Hacker Jailbroke Claude to Steal 150GB of Mexican Government Data

An unknown attacker used over 1,000 prompts to jailbreak Anthropic's Claude, generating exploit code that breached six Mexican government agencies and exfiltrated 195 million taxpayer records.

RoguePilot - How a Hidden Comment in a GitHub Issue Could Steal Your Entire Repository

Orca Security reveals RoguePilot, a supply chain attack that weaponizes GitHub Issues to hijack Copilot in Codespaces and exfiltrate repository tokens.