OpenAI o1 Outperforms ER Doctors in Harvard Trial

A Harvard-led team published results Friday in Science that are hard to wave away: OpenAI's o1-preview model correctly identified or closely approximated the right diagnosis in 67.1% of emergency room cases at initial triage - compared to 55.3% and 50.0% for two experienced physicians given the same information.

The study, titled "Performance of a large language model on the reasoning tasks of a physician," is the most rigorous head-to-head test of a reasoning LLM against real clinicians in a live hospital setting published in a major peer-reviewed journal.

TL;DR

OpenAI o1-preview hit 67.1% exact-or-close diagnostic accuracy at ER triage vs 55.3% and 50.0% for expert physicians on 76 Beth Israel Deaconess cases
With more patient data, the gap narrows: o1 reaches 82%, humans 70-79% (not statistically significant)
Treatment planning is the blowout number: o1 scored 89% vs 46 doctors at 34% using conventional search tools
Limitations are real - text-only inputs, no imaging, hallucination rate unmeasured
Lead authors call for randomized controlled trials, not immediate clinical deployment

How the Study Was Designed

The team, led by Peter G. Brodeur and Thomas A. Buckley at Harvard Medical School with collaborators at Stanford, ran o1-preview through three stages of a standard emergency department visit at Beth Israel Deaconess Medical Center in Boston.

The Three Stages

Stage 1 - Initial triage: The model received only what a triage nurse records: vital signs, demographics, and a few sentences about why the patient came in. Seventy-six real cases. Two independent physicians who didn't know whether assessments came from the AI or from human reviewers scored the outputs blind.

Stage 2 - First physician contact: Richer data, including the doctor's notes from the initial exam. The performance gap here shrank considerably.

Stage 3 - Admission decisions: Full chart data available to both AI and doctors, including lab results and imaging notes (though not the images themselves).

The treatment planning component was separate: o1 and 46 physicians were each given five complex case studies and asked to recommend management plans. The evaluators scored the outputs without knowing the source.

The Numbers

Stage	o1-preview	Expert Physicians	Stat. Significant
Initial triage (limited info)	67.1% exact/close	55.3% / 50.0%	Yes
With more data (full chart)	82%	70-79%	No
Correct dx in differential	78.3% of cases	-	-
Treatment planning	89%	34%	Yes

The treatment planning gap is the figure that'll get the most attention. A score of 89% versus 34% across 46 physicians isn't noise - it's a structural difference in how the model approaches management reasoning versus how doctors using standard search tools do the same task.

Buckley noted that o1 was especially strong at the earliest triage stage, "when there was the least information available" - exactly the conditions that most often lead to missed diagnoses in real emergency departments.

Why This Matters Now

Emergency departments are under persistent pressure. Overcrowding, staff shortages, and time constraints make initial triage one of the highest-stakes and most error-prone moments in acute care. A model that can reliably suggest a correct differential at triage - before a physician sees the patient - could meaningfully change how undifferentiated complaints get handled.

The study's authorship gives it weight. Adam Rodman, one of the co-authors, is known for his work on clinical reasoning education. Arjun Manrai co-leads the Computational Health Informatics Program at Boston Children's. These aren't AI researchers claiming a health win - they're clinicians with track records publishing in the world's most prestigious general science journal.

The result also joins a growing body of evidence. Last year, Superpower launched an AI doctor platform with persistent patient memory and biomarker tracking - a consumer-side bet that AI can beat the 15-minute clinical visit at routine longitudinal care. The Harvard study is the first randomized evidence that the same gap applies in acute settings.

A doctor reviewing patient data on a screen alongside an AI diagnostic tool AI-assisted diagnosis tools are being evaluated in clinical settings across multiple hospital networks. Source: digitaltrends.com

What It Does Not Tell You

No images, no body language, no clinical gestalt

The study gave o1 text. Emergency medicine runs on information that text doesn't capture: a patient's color, respiratory effort, how they hold their arm, the smell of ketones. The AI didn't see a single chest X-ray. It didn't hear labored breathing. Physicians working in the real ER use all of this, constantly.

Study co-author Manrai was direct about this: "AI models can get things wrong" and "can be sycophantic" - generating confident-sounding wrong answers rather than flagging uncertainty. The study didn't measure hallucination rate at all. Buckley acknowledged the team "didn't formally measure the hallucination rate of these models." For a diagnostic tool, that's not a footnote - it's a central safety question.

The 76-case ceiling

Seventy-six cases is enough for a proof-of-concept study but not enough to draw conclusions about performance across the range of presentations an emergency physician sees in a career. The cases came from one hospital in Boston. Performance on a different patient population, or on cases that skew toward rare presentations, is unknown.

Physicians weren't using AI tools themselves

The human doctors in the study were limited to "conventional resources, such as search engines." That's a realistic baseline for most clinicians today, but it also means the study doesn't answer whether a physician augmented by o1 would beat either alone. That's the question that actually matters for deployment, and Rodman pointed to exactly this in his comments - he sees a "triadic care model" involving "the doctor, patient, and AI system" rather than AI replacing the physician role.

Outcomes weren't tracked

The study measured diagnostic and treatment reasoning accuracy, not patient outcomes. A model that scores well on paper may recommend a course of action that, in practice, misses the clinical picture or leads a physician toward overconfidence. Rodman was explicit that the field needs "extraordinarily strong evidence, such as a randomized controlled trial" before clinical deployment.

Patient in a hospital bed receiving emergency care Beth Israel Deaconess Medical Center in Boston provided the 76 ER cases used in the study. Source: unsplash.com

The Model Caveat Worth Reading Closely

The model tested here is o1-preview - not o1, not o3, not anything from the current GPT-5.x generation. OpenAI's reasoning model lineup has moved considerably since o1-preview was the state of the art. This matters in two directions: the results likely understate what current models could achieve, but it also means the study isn't assessing a model anyone would actually deploy today.

The research team's own framing is the correct one. This study is a signal to design rigorous clinical trials - not a green light for hospitals to start routing triage notes to an API endpoint.

What Comes Next

The Science publication will accelerate two parallel tracks. One is regulatory: the FDA has been cautious about LLM-based clinical decision support tools, and this study gives developers a credible evidence base to begin formal approval pathways. The other is research: with a published benchmark now in place, replication studies using current models and larger patient samples are the obvious next step.

Independent evaluation will matter here. Several recent studies have noted that AI benchmarks in medicine are susceptible to the same kinds of problems that have affected model benchmarks elsewhere - evaluation leakage, cherry-picked tasks, and claims that don't hold under third-party testing. The treatment planning figure, in particular, will attract scrutiny.

For now, the study stands as the strongest published evidence yet that reasoning-capable LLMs have reached a threshold where controlled clinical testing is warranted. The question is no longer whether AI can match text-based diagnostic reasoning. It's whether text-based diagnostic reasoning is the part of medicine that most needs improving.

The researchers have called for randomized controlled trials as the immediate next step. Until those exist, the 89% vs 34% treatment planning figure will do a lot of work in boardrooms and grant applications. Whether it does work in emergency departments depends on trials that don't yet exist.

Sources: