Vibe Coding Is a Security Catastrophe: 69 Vulnerabilities Found Across 5 Major AI Coding Tools

TL;DR

Security firm Tenzai tested 5 AI coding tools by building 3 identical apps each - found 69 vulnerabilities across all 15 apps
Every single tool introduced Server-Side Request Forgery (SSRF) vulnerabilities. Zero apps implemented CSRF protection. Zero apps set security headers
Carnegie Mellon found that 61% of AI-generated code is functionally correct but only 10.5% is secure
Escape.tech discovered 2,000+ vulnerabilities and 400+ exposed secrets in 5,600 publicly deployed vibe-coded applications

The numbers are in, and they are worse than the pessimists predicted. A December 2025 study by security startup Tenzai systematically tested five of the most popular AI coding tools - Claude Code, OpenAI Codex, Cursor, Replit, and Devin - by having each build three identical web applications from pre-defined prompts. The result: 69 vulnerabilities across 15 applications, with patterns so consistent they suggest the problem is structural, not incidental.

The Tenzai Audit

Researcher Ori David designed the test to isolate each tool's security baseline. Same applications, same prompts, different agents. The breakdown:

Agent	Total Vulnerabilities	Critical
Claude Code	16	4
OpenAI Codex	13	1
Cursor	13	0
Replit	13	0
Devin	14	1

What They Got Right

Credit where it is due: none of the tools produced exploitable SQL injection or cross-site scripting in the traditional sense. They consistently used parameterized queries and relied on framework-level sanitization. The "solved" vulnerability classes - the ones with generic, pattern-based defenses - are genuinely handled well.

What They Got Catastrophically Wrong

The failures clustered in three areas that share a common trait: they require contextual understanding that AI does not have.

Authorization logic. The most common failure. Codex skipped validation for non-shopper roles entirely. Claude Code generated code that checked authentication but skipped all permission validation when users were not logged in, enabling unrestricted product deletion.

Business logic. Four of five agents allowed negative order quantities. Three allowed negative product prices. These are not obscure edge cases - they are the first thing a human QA tester checks.

Server-Side Request Forgery. All five agents introduced SSRF in a URL preview feature, allowing attackers to invoke requests to arbitrary internal URLs, access internal services, bypass firewalls, and leak credentials. Five out of five. One hundred percent.

The Missing Basics

The "ugly" category is arguably worse than the vulnerabilities themselves:

CSRF protection: 0 of 15 apps implemented it (2 attempted, both failed)
Security headers: 0 of 15 apps set CSP, X-Frame-Options, HSTS, X-Content-Type-Options, or proper CORS
Rate limiting: 1 of 15 apps attempted it - and the implementation was bypassable via the X-Forwarded-For header

"Coding agents cannot be trusted to design secure applications," Tenzai concluded. "They seem to be very prone to business logic vulnerabilities. While human developers bring intuitive understanding that helps them grasp how workflows should operate, agents lack this 'common sense.'"

The researchers also tested whether security-focused prompts could fix the problem. They added explicit vulnerability warnings and risk identification instructions. The result: "minimal vulnerability reduction."

The Broader Data

Tenzai's study is not an outlier. Multiple independent assessments have converged on the same conclusion.

Carnegie Mellon: 61% Correct, 10.5% Secure

The SusVibes benchmark from Carnegie Mellon tested SWE-Agent with Claude 4 Sonnet on 200 real-world feature-request tasks. The finding: 61% of solutions were functionally correct, but only 10.5% were secure. Even augmenting prompts with explicit vulnerability hints could not close that gap.

Veracode: 45% Vulnerability Rate

The 2025 GenAI Code Security Report tested 80 coding tasks across 100+ LLMs in four languages. AI introduced OWASP Top 10 vulnerabilities in 45% of cases. Java had it worst at over 70%. CWE-80 (cross-site scripting) showed failure rates of 86%, with no improvement even in the latest models including GPT-5.

CodeRabbit: AI Code Introduces 1.7x More Issues

Analysis of 470 GitHub PRs (320 AI-co-authored, 150 human-only) found AI-generated code produces:

2.74x more XSS vulnerabilities
1.91x more insecure object references
1.88x more improper password handling
8x more excessive I/O operations
3x more readability problems

Escape.tech: 2,000+ Vulns in the Wild

The most alarming data comes from Escape.tech, which scanned 5,600 publicly available applications built on vibe coding platforms (Lovable, Base44, Create.xyz, Vibe Studio, Bolt.new). They found:

2,000+ vulnerabilities
400+ exposed secrets (API keys, tokens)
175 instances of PII exposure including medical records, IBANs, and phone numbers
Exposed authentication tokens in JavaScript bundles
Misconfigured Row-Level Security policies in Supabase

The researchers described their results as "lower-bound estimates" because they used intentionally conservative passive scanning.

The AI IDE Vulnerability Crisis

The tools themselves are not just generating insecure code - they are insecure. Security researcher Ari Marzouk disclosed 30+ vulnerabilities across 24 CVEs in the AI coding tools developers use daily:

CVE	Tool	Severity	Issue
CVE-2025-54135	Cursor	8.6	Auto-executes MCP config changes even when user rejects suggestion
CVE-2025-55284	Claude Code	High	DNS exfiltration via prompt injection reads .env files
SpAIware	Windsurf	High	Memory-persistent data exfiltration survives across sessions
IDEsaster	12 tools	Multiple	JSON schema exfiltration, config-based RCE, workspace overrides

The Cursor vulnerability (CurXecute) is particularly striking: when the agent suggests an edit to ~/.cursor/mcp.json, the edit lands on disk and triggers command execution even if the user rejects the suggestion in the UI. A malicious Slack message, when summarized by Cursor's AI, was demonstrated to rewrite MCP config files and execute arbitrary commands with developer privileges within minutes.

What It Does Not Tell You

These studies test default behavior - what happens when a developer prompts an AI tool without explicitly requesting secure code. Databricks' AI Red Team found that self-reflection prompts can improve security by 60-80% for Claude and up to 50% for GPT-4o. The tools can find their own vulnerabilities when asked.

But that is precisely the problem vibe coding was supposed to solve. The entire premise is that developers - or non-developers - can describe what they want and get working software. Requiring them to also know which security prompts to add defeats the purpose.

As Palo Alto Networks Unit 42 put it: "AI agents are optimized to provide a working answer, fast. They are not inherently optimized to ask critical security questions."

The data is unambiguous. AI coding tools produce functionally correct software at unprecedented speed. They also produce software riddled with authorization flaws, missing security controls, and business logic errors that no human developer would ship. The 69 vulnerabilities in Tenzai's study are not bugs to be fixed in the next model release. They are a structural consequence of tools that optimize for "does it work?" while ignoring "is it safe?" Until the incentive structure changes - or security becomes a native part of the generation pipeline rather than an afterthought - every vibe-coded application is a penetration tester's dream.

Sources: