Item: GPT-Rosalind
Author: Elena Marchetti

OpenAI's GPT-Rosalind launched in April as the company's first domain-specific reasoning model, built from the ground up for biology and drug discovery rather than fine-tuned from a general base. Two months later, on June 3, 2026, a significant update arrived: global access opened for the first time, two new genomics plugins shipped, and benchmarks against GPT-5.5 gave researchers harder numbers to work with. I've spent the past week digging through the benchmarks, the partner announcements, the access conditions, and the safety disclosures. The picture is complicated.

TL;DR

7.0/10 - truly strong benchmark results, but the access wall and undisclosed pricing make it irrelevant to most researchers today
Top BixBench score (0.751 Pass@1), 31% fewer tokens than GPT-5.5 on genomics tasks, and new NGS Analysis plugin in June
Trusted Access Program is still gated - you need a research governance structure, enterprise security controls, and a legitimate biology use case to qualify
Best for large pharma and well-funded academic institutions; the free Codex Life Sciences plugin is the more realistic option for everyone else

What OpenAI Actually Built

GPT-Rosalind is named after Rosalind Franklin, the British chemist whose X-ray crystallography work was essential to understanding DNA structure - a pointed choice that signals how seriously OpenAI is positioning this. The model is trained specifically for the multi-step reasoning that controls early drug discovery: synthesizing literature across hundreds of papers, interpreting sequencing outputs, designing CRISPR experiments, and chaining database queries into coherent analytical workflows.

OpenAI hasn't disclosed the parameter count, architecture, or context window, which blocks independent analysis. What's clear is that this isn't a fine-tuned wrapper on GPT-5.5. The company describes it as a reasoning model built for life sciences from training, not adapted after the fact. Whether that distinction produces meaningfully different emergent behavior - or is mostly a marketing claim - is hard to verify without access.

The accompanying Codex Life Sciences plugin is a separate product, free to any Codex user, connecting general-purpose models to more than 50 public biological databases: AlphaFold protein structures, Bgee gene expression data, BindingDB ligand-target affinity records, PubMed literature, and a dozen others. Two new plugins arrived with the June update - Life Sciences Research and Life Sciences NGS Analysis - extending coverage into next-generation sequencing workflows. These plugins work with standard Codex models, not just GPT-Rosalind. For most researchers, the plugin is the actual shipped product.

What Changed in June

The April launch restricted access to US enterprise customers only. The June 3 update opened the Trusted Access Program globally - organizations outside the US can now apply, provided they meet the governance and security requirements. Novo Nordisk joined the early partner roster, which already included Amgen, Moderna, the Allen Institute, Thermo Fisher Scientific, Oracle Health, NVIDIA, Benchling, UCSF School of Pharmacy, Los Alamos National Laboratory, and Dyno Therapeutics.

A new LifeSciBench evaluation framework also shipped. It covers six research workflow categories: evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, and translation and communication. OpenAI hasn't published the full results yet, but the framework itself is an acknowledgment that BixBench alone doesn't capture the breadth of real research work.

Interactive viewers for biological file formats - sequences, alignments, protein structures - let researchers inspect raw evidence mid-session without switching tools. That sounds minor, but it addresses a real friction point: current workflows involve constant context-switching between the model and domain-specific viewers.

A researcher in a genomics lab working with sequencing equipment Genomics sequencing labs like this one at the Cancer Genomics Research Laboratory run workflows GPT-Rosalind targets: sequencing interpretation, protocol design, and multi-step bioinformatics analysis chained inside one session. Source: unsplash.com

Benchmarks: Impressive Margins, Modest Absolutes

Three benchmarks matter here, and they tell different stories.

BixBench (built by FutureHouse and maintained by Edison Scientific) hands an agent an empty Jupyter notebook and 53 real-world bioinformatics scenarios with 296 questions. GPT-Rosalind scored 0.751 Pass@1 at launch, well ahead of the field:

Model	BixBench (Pass@1)
GPT-Rosalind	0.751
GPT-5.4	0.732
GPT-5	0.728
Grok 4.2	0.698
Gemini 3.1 Pro	0.550

The margin over Gemini 3.1 Pro is major. The margin over GPT-5.4 is 1.9 points - real, but not the kind of gap that alone justifies a separate model.

MedChemBench and GeneBench are where the June update added useful data. Against GPT-5.5:

Benchmark	GPT-Rosalind	GPT-5.5	Token delta
MedChemBench	27.5%	25.1%	-7.2%
GeneBench	21.6%	20.4%	-31%
LabWorkBench	63.2%	55.8%	-5.3%

The absolute numbers on MedChemBench and GeneBench are modest - 27.5% and 21.6% respectively. OpenAI's own framing is accurate: "acceleration for expert researchers, not autonomous drug design." These aren't pass rates that show the model can run a drug discovery pipeline independently. They indicate it performs meaningfully better than a general-purpose frontier model on specialist tasks, while using fewer tokens.

The Dyno Therapeutics result from April is still the most interesting data point, and the hardest to reproduce. Dyno gave the model unpublished RNA sequences - data that couldn't have appeared in training - and GPT-Rosalind's best-of-ten submissions ranked above the 95th percentile of human experts on sequence-to-function prediction, and around the 84th percentile on sequence generation. Dyno supplied the evaluation, so contamination concerns are lower than in most vendor benchmarks. It's also not reproducible outside Dyno's environment. The RNA result is genuinely impressive and truly isolated.

For broader context on how GPT-Rosalind sits relative to other science-focused models, the scientific reasoning LLM leaderboard tracks cross-model science benchmarks with third-party methodology.

A scientist using a pipette in a drug discovery research lab Drug discovery research still requires wet-lab validation regardless of what any AI model produces. GPT-Rosalind targets the computational stages that precede bench work. Source: unsplash.com

The Access Problem

GPT-Rosalind isn't available to the general public. It isn't available in ChatGPT Plus, the standard API, or any self-serve tier. Access requires applying to the Trusted Access Program - showing legitimate biology research with public benefit, strong internal governance, enterprise-grade security controls, and passing an application review that screens against misuse.

No pricing has been disclosed. The model runs free during the research preview for accepted organizations, but "free during preview" isn't a budget anyone can plan around. OpenAI says pricing and broader availability will come "as the program expands" with no date attached.

The global expansion in June is a real improvement - international research institutions can now apply. But the structural access problem remains. Independent researchers, small biotech firms, and academic labs without enterprise agreements can't use it. The equity concern is genuine: defining "legitimate research" is already hard inside the US, and harder still across institutions with different governance norms.

"GPT-Rosalind represents an important step in helping scientific teams use advanced AI to reason across complex biological evidence, data, and workflows," said Moderna CEO Stéphane Bancel.

Partner quotes from CEOs at Amgen, Moderna, and Allen Institute are predictably positive. What's missing from the launch coverage is any independently verified result from a team that didn't have a pre-existing relationship with OpenAI.

The Codex Plugin Is the Real Shipped Product

If you're not in the Trusted Access Program, the Life Sciences plugin for Codex is what actually shipped for you. It connects general models to the same 50+ databases, enables protein structure lookup via AlphaFold, surfaces ligand-target affinity data from BindingDB, and runs literature queries across PubMed inside a Codex session.

This is useful. Bioinformatics engineers who currently context-switch between a notebook, a database browser, and a LLM can run most of that inside one Codex session with the plugin. The domain fine-tuning of GPT-Rosalind itself isn't available here - you're pairing GPT-5.4 or similar with better tool access, not getting the specialized reasoning model. For many workflows, that's a meaningful difference. For others, it probably isn't.

The NGS Analysis plugin from June adds next-generation sequencing workflows to that stack, which extends the coverage far for genomics teams.

Competition: Isomorphic, Chai, Anthropic

Three competitors worth comparing against GPT-Rosalind each take a different approach.

Isomorphic Labs' IsoDDE pipeline isn't an AI tool - it's a co-development arrangement. Isomorphic holds approximately $3 billion in pharma partnerships and delivers drug candidates under commercial terms. GPT-Rosalind provides enterprise access to a research productivity tool. These aren't in direct competition.

Chai Discovery's Chai-2 specializes in antibody design, reporting a 16% hit rate across 52 targets. Rosalind functions as a general research agent rather than a molecule-specific generator. Different category.

Anthropic's acquisition of Coefficient Bio positions Claude as a competitor through a different path: ecosystem plugins (PubMed, Benchling, 10x Genomics, ChEMBL) rather than domain-specific fine-tuning. Claude's advantage is general availability - any researcher can use it today, without a qualification process. GPT-Rosalind's advantage, for those who qualify, is deeper training on the domain.

Safety: The Gating Isn't Just Marketing

The access restrictions aren't purely commercial. Anthropic's 2025 red-team work found that frontier models meaningfully improved the quality of bioweapon acquisition plans produced by motivated expert actors. GPT-Rosalind's governance review and biosecurity refusal training are documented parts of the qualification flow, not afterthoughts.

Independent biosecurity researcher group SecureBio found that "safeguard robustness to circumvention by highly motivated expert actors remained uncertain." That assessment was cited in coverage and not disputed by OpenAI. Gating the model is a reasonable policy response to that uncertainty. It also means the gating will remain in place until robustness improves, which caps the addressable user base indefinitely.

OpenAI's framing - that the model is for expert researchers, not autonomous drug design - is accurate. The absolute benchmark scores confirm it. But the dual-use concern isn't removed by stating the intended use case.

Strengths

Top BixBench score - 0.751 Pass@1 ahead of every published comparable, with a 20-point gap over Gemini 3.1 Pro
31% fewer tokens than GPT-5.5 on genomics tasks - meaningful cost reduction for pipelines processing large biological datasets
Dyno RNA result uses unpublished sequences, reducing contamination concerns compared with most vendor benchmarks
Codex plugin ships free to any Codex user, with 50+ scientific databases and now NGS Analysis coverage
Partner roster confirms the market - Amgen, Moderna, Allen Institute, Los Alamos, and UCSF are serious research shops
Biosecurity gating is principled - the access requirements map to real dual-use concerns, not just commercial exclusivity

Weaknesses

Trusted Access Program still closed to most - no self-serve tier, no public API, no ChatGPT Plus integration
Pricing undisclosed - impossible to budget for or assess cost-efficiency in real research settings
Architecture, parameters, context window all hidden - independent analysis isn't possible
MedChemBench and GeneBench absolute scores are modest - 27.5% and 21.6% mean the model accelerates experts, it doesn't replace them
All benchmarks are vendor-selected or partner-evaluated - no third-party independent replication yet
$17 billion invested in AI drug discovery since 2019, zero AI-developed drugs through large-scale clinical trials - the gap between benchmark wins and therapeutic outcomes remains wide
SecureBio found safeguard robustness uncertain - the biosecurity justification for gating is also a signal the model's safety properties aren't fully characterized

Verdict

GPT-Rosalind is the strongest publicly benchmarked AI model for biology research tasks, and the June update's global expansion makes it at least theoretically accessible to more institutions. The Codex Life Sciences plugin is genuinely useful for anyone working in bioinformatics today, regardless of whether they qualify for the full model.

The problem is that "best model for qualified enterprise pharma teams" describes a tiny slice of the research community. Most academic biologists, independent researchers, and smaller biotech companies can't use it. Pricing is still a blank. The absolute benchmark scores on the harder tasks are encouraging, not transformative.

For any institution that does qualify, the combination of domain-specific reasoning and the new NGS Analysis plugin makes GPT-Rosalind worth testing against your specific workflows. For everyone else, the free Codex plugin is where this launch actually landed.

Score: 7.0/10

GPT-Rosalind Review: The Gated Drug Discovery Model