OpenAI Launches Codex Security, 14 Days After Anthropic

OpenAI launches Codex Security in research preview, scanning 1.2M commits and finding 11,353 critical and high-severity vulnerabilities. The AI vulnerability arms race is officially on.

OpenAI Launches Codex Security, 14 Days After Anthropic

TL;DR

  • OpenAI launches Codex Security in research preview on March 6, exactly 14 days after Anthropic's Claude Code Security announcement
  • During beta, scanned 1.2 million commits and flagged 792 critical plus 10,561 high-severity vulnerabilities
  • Earned 14 CVEs across OpenSSH, GnuTLS, GOGS, Chromium, PHP, libssh, and gpg-agent
  • Available to ChatGPT Enterprise, Business, and Edu customers - free for the first month
  • Open-source maintainers get free access through the new Codex for OSS program

Fourteen days. That's how long it took for OpenAI to respond after Anthropic's Claude Code Security sent cybersecurity stocks into freefall and forced the industry to confront what AI vulnerability scanning actually looks like at scale. On March 6, OpenAI launched Codex Security in research preview - and brought receipts.

The tool, which evolved from the Aardvark private beta that launched last October, scanned 1.2 million commits across external repositories during its beta period. It surfaced 792 critical-severity and 10,561 high-severity findings. OpenAI also disclosed 14 CVEs across major open-source projects, including GnuTLS, Chromium, PHP, and OpenSSH.

How Codex Security Works

The system operates in three stages, each designed to reduce the noise that plagues traditional static analysis.

Stage 1: Threat modeling. Codex Security clones a repository into an isolated container and analyzes the project's architecture - file structure, trust boundaries, authentication flows, data handling patterns. It produces an editable threat model in natural language that describes what the system does and where it's most exposed. Teams can review and adjust this model before scanning begins.

Stage 2: Context-aware scanning. Using the threat model as a foundation, the agent scans for vulnerabilities and classifies each finding based on real-world exploitability rather than abstract pattern matching. A SQL injection in a test fixture gets triaged differently than one in a production API endpoint. This is where the noise reduction claims come from.

Stage 3: Sandbox validation. Flagged issues get pressure-tested in a sandboxed environment. The agent attempts to construct proof-of-concept exploits to confirm exploitability, then ranks validated findings by severity and generates remediation code with explanations.

The pipeline can take hours or days depending on repository size - this isn't a quick linting pass.

The CVE Haul

OpenAI's 14 CVE assignments span seven projects:

ProjectVulnerabilityCVE
GnuTLSHeap buffer overread in SCT extension parsingCVE-2025-32989
GnuTLSDouble-free in otherName SAN exportCVE-2025-32988
GOGSTwo-factor authentication bypassCVE-2025-64175
GOGSUnauthenticated access bypassCVE-2026-25242
gpg-agentStack buffer overflow (2 findings)Pending
OpenSSHNot disclosed (in responsible disclosure)Pending
PHPNot disclosedPending
ChromiumNot disclosedPending
libsshNot disclosedPending

The GnuTLS findings are technically interesting. A heap buffer overread in certificate transparency SCT parsing and a double-free during Subject Alternative Name export are both the kind of memory safety bugs that C codebases accumulate over decades. Neither is the sort of pattern that shows up in signature databases.

The GOGS vulnerabilities are arguably more impactful. CVE-2025-64175 bypasses two-factor authentication entirely, and CVE-2026-25242 allows unauthenticated access. GOGS is a self-hosted Git service used by organizations that specifically chose it to avoid depending on GitHub. Authentication bypasses in that context are severe.

Beta Metrics

OpenAI is claiming significant improvements over the Aardvark beta:

  • 84% noise reduction between initial rollout and the current version
  • 50%+ decrease in false positive rates across all monitored repositories
  • 90% reduction in over-reported severity levels
  • Critical vulnerabilities appeared in fewer than 0.1% of scanned commits

These numbers describe internal improvement over time, not absolute accuracy against a benchmark. That's an important distinction. The 50% false positive reduction means "50% fewer false positives than our first version," not "50% fewer than Snyk."

Head-to-Head With Claude Code Security

The timing makes comparison inevitable. Semgrep published an independent evaluation that tested both tools against modern web application vulnerabilities. Their findings:

MetricClaude Code SecurityCodex Security
Vulnerabilities found4621
True positive rate14%18%
False positive rate86%82%

Claude finds more issues but with lower precision. Codex finds fewer but is slightly more accurate. Both tools have false positive rates above 80%, which means the majority of flagged issues aren't real vulnerabilities. That's comparable to or worse than existing SAST tools for this particular benchmark, though the nature of what they find - semantic, multi-file vulnerabilities versus pattern matches - is qualitatively different.

The headline numbers tell a different story. Anthropic's red team found 500+ high-severity vulnerabilities in open-source projects. OpenAI's beta flagged 11,353 across 1.2 million commits. These aren't directly comparable - different codebases, different thresholds, different disclosure timelines - but both numbers are large enough to confirm that AI-powered scanning finds things traditional tools miss.

Availability and Pricing

Codex Security is rolling out in research preview to ChatGPT Enterprise, Business, and Edu customers through the Codex web interface. The first month is free. Pricing after that hasn't been disclosed.

OpenAI also launched Codex for OSS, a program giving open-source maintainers free ChatGPT Pro and Plus accounts, code review support, and Codex Security access. The vLLM inference engine team is among the first participants. OpenAI says it plans to expand the program in the coming weeks.

Claude Code Security, by comparison, is in limited research preview for Claude Enterprise and Team customers, with free expedited access for open-source maintainers.

Both companies are making the same bet: give security tools away to open source, lock in enterprise customers with the paid version.

What This Race Means

Two weeks between major AI security tool launches from the two leading foundation model companies isn't coincidence. Both Anthropic and OpenAI see vulnerability scanning as a wedge into enterprise security budgets - and both are willing to give the product away initially to establish their models as the default security layer in development workflows.

The competitive pressure has one clear beneficiary: open-source maintainers who now have two free, AI-powered security scanners to choose from. The cybersecurity industry that watched billions evaporate from its market cap on February 20 now faces a second entrant with its own CVE track record.

For enterprise buyers, the choice between Codex Security and Claude Code Security will likely come down to which AI platform they already use for coding. OpenAI has Codex and ChatGPT Enterprise; Anthropic has Claude Code. Security scanning becomes a feature that deepens platform lock-in rather than a standalone product.

The false positive rates from both tools - above 80% in Semgrep's testing - suggest neither is ready to replace human security review. But that was never the pitch. The pitch is finding the bugs that humans and traditional tools miss, then letting humans triage the results. On that metric, 14 CVEs across seven major open-source projects in a single beta period is hard to argue with.


Sources:

OpenAI Launches Codex Security, 14 Days After Anthropic
About the author AI Infrastructure & Open Source Reporter

Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem.