Suleiman Claims AI Takes White-Collar Jobs in 18 Months

As AI companies line up to demo their latest capabilities at developer conferences this week, a claim made by Microsoft AI CEO Mustafa Suleiman in February is getting renewed attention - and fresh scrutiny.

"We're going to have a human-level performance on most, if not all, professional tasks within the next 12 to 18 months. Most tasks that involve sitting down at a computer - whether you're a lawyer or an accountant or a project manager or a marketing person - most of those tasks will be fully automated by AI within the next year or 18 months."
Mustafa Suleiman, Microsoft AI CEO, Financial Times interview, February 2026

TL;DR


What Suleiman said	AI reaches human-level on most white-collar tasks in 12-18 months from February 2026
What we found	Experienced developers work 19% slower with AI; legal AI tools error at 17-34%; benchmark scores don't transfer to real work
The context	Suleiman runs Copilot, which holds roughly 1.2% of the AI chatbot market and needs enterprise deals to grow

The Claim

Suleiman made his prediction in a Financial Times interview, citing two main pieces of evidence: the exponential growth in training compute over the past 15 years, and the observation that "many software engineers report that they are now using AI-assisted coding for the vast majority of their code production." He framed the shift as already underway, not hypothetical.

The timeline matters. Twelve to eighteen months from February 2026 puts the automation threshold between February and August 2027. That is a specific claim with a testable deadline.

Suleiman co-founded DeepMind, which solved the protein-folding problem in 2020, so his track record on AI capability calls is not trivial. He joined Microsoft in March 2024 to lead Copilot and AI products. The question is whether this particular claim reflects what the data actually shows about AI performance on professional tasks.

Mustafa Suleiman, Microsoft AI CEO, at a speaking event Mustafa Suleiman joined Microsoft in March 2024 to lead Copilot and AI products. Source: commons.wikimedia.org

The Evidence

Software Engineering: The Primary Pillar Falls

Suleiman's most concrete evidence was software engineering. He argued that current models "can code better than the vast majority of human coders." But a study by METR (Model Evaluation and Threat Research) published in July 2025 - seven months before Suleiman's February claim - directly tested that assumption.

METR recruited 16 experienced open-source developers and gave them 246 real programming tasks, half with AI assistance (Cursor Pro with Claude 3.5 and 3.7 Sonnet) and half without. The result: developers using AI took 19% longer to complete tasks.

The researchers were careful to note this covered senior developers on complex, real-world work rather than toy problems. The developers themselves expected AI to make them roughly 24% faster. Their self-assessment even after the study was that it helped by about 20%. The actual measured slowdown was 19%.

This gap - between what AI can do on isolated code challenges and what it does to experienced developers working on real projects - is exactly the gap Suleiman's prediction elides.

Legal: The Bar Exam Isn't the Practice of Law

AI systems have scored well on bar exams for years. The bar exam is a fixed, text-based test with correct answers. Legal practice isn't. A MIT Technology Review analysis from December 2025 looked at this gap in detail, noting that "an LLM still can't think like a lawyer."

The actual error rates from legal AI tools in production are stark. Lexis+ AI - Lexis Nexis's own legal research tool - runs at a 17% error rate. Westlaw's equivalent runs at 34%. These are purpose-built, fine-tuned systems for the legal domain rolled out by the companies whose business depends on them working.

ScaleAI's Professional Reasoning Benchmark, published in November 2025, tested leading models on difficult legal problems designed by practicing lawyers. The best model scored 37%. The researchers found that models "frequently made inaccurate legal judgments" and when they reached correct conclusions, "used incomplete or opaque reasoning processes."

None of the AmLaw 100 firms - the 100 largest law firms in the United States - are planning to reduce attorney headcount despite widely reported claims of "100x productivity gains" on specific narrow tasks.

Broader Professional Work: The Real-World Gap

A professional office environment with workers collaborating The distance between benchmark performance and real professional task completion remains sizable across sectors. Source: pexels.com

Across professions, the pattern is consistent: AI performs well on isolated, well-defined sub-tasks and falls down on the longer-horizon reasoning, judgment, and integration of context that actual professional work requires.

A 2026 benchmark specifically designed to test end-to-end professional task completion - not capability on isolated questions - found frontier models hitting roughly a 4% success rate on real work assignments. This wasn't a test of whether AI can write a contract clause; it was a test of whether AI can complete a full professional assignment under realistic conditions.

Thomson Reuters's 2025 report on AI in accounting and auditing found "marginal productivity improvements" and documented cases where AI made some workers measurably less productive.

Claim vs Reality

Profession	Suleiman's Claim	Independent Evidence
Software engineering	"Codes better than vast majority of humans"	19% slower on real tasks (METR, Jul 2025)
Legal	"Fully automated within 18 months"	17-34% error rates; 37% on hard problems
Accounting	"Fully automated within 18 months"	Marginal gains; some cases show no improvement
All white-collar	"Human-level on most, if not all, tasks"	~4% success on end-to-end real task completion

What They Left Out

The Commercial Incentive

Suleiman is the person responsible for selling Microsoft Copilot to enterprises. Copilot held approximately 1.2% of the AI chatbot market as of early 2026, despite running inside Windows, Office, and Azure - products used by hundreds of millions of people. Enterprise adoption has been slower than Microsoft projected.

Making a dramatic claim about AI's imminent automation of professional work is, functionally, a sales argument for enterprise AI investment. That doesn't make the claim false, but it's context the Financial Times interview didn't foreground.

Benchmarks Are Not Jobs

Suleiman's evidence rests heavily on benchmark performance and the observation that many engineers now use AI for significant portions of their code. Both facts are real. They don't show what he claims they show.

Benchmark improvements in software engineering - where scores have moved from roughly 60% to near-perfect on SWE-Bench Verified in eighteen months - haven't translated into the real-world productivity gains those numbers suggest. The gap between "can complete this task when given the full context and a well-specified problem" and "can do this job" is where the automation timeline falls apart.

The METR study was published seven months before Suleiman's February prediction. His characterization of software engineers benefiting massively from AI coding tools was the opposite of what METR's controlled study found.

Compute Scaling Is Not a Job Completion Plan

Suleiman's core evidence is that training compute has grown a trillionfold over fifteen years and will grow another thousandfold in the next three. This is likely true. It isn't, by itself, evidence that professional tasks will be automated on any particular timeline.

Scaling has produced remarkable capabilities. It has not closed the gap between task performance on benchmarks and reliable end-to-end execution on professional work. Those are different problems.

Researchers at Karpathy-adjacent labs and independent groups have analyzed AI job exposure in more nuanced ways, finding that the distribution of vulnerability across professions is far more uneven than Suleiman's broad claim suggests. Anthropic's own research on AI job risk and economic exposure reached similar conclusions about the patchy, task-level nature of current automation potential.

For more background on how AI safety research intersects with these capability claims, our guide to AI alignment covers the gap between benchmark performance and deployment reliability.

Suleiman's claim may be right in the very long run. On the specific 12-to-18-month timeline from February 2026, the evidence consistently points in the other direction. The data from METR, from legal AI deployments, from enterprise productivity studies, and from real-world task benchmarks all show a major gap between what frontier models can do in controlled settings and what they reliably deliver in professional work. Closing that gap will require more than scaling. Suleiman hasn't explained why the data already available before he made his prediction should not count.

Sources: