AI Models Pass Vision Tests Without Seeing the Images

Every major frontier AI model you can name - GPT-5, Gemini 3 Pro, Claude Opus 4.5 - reaches 70 to 80 percent of its visual benchmark scores without being given any images at all. Stanford researchers have a name for this: the mirage effect. Their paper, published this month, argues that current multimodal benchmarks don't measure visual understanding. They measure language reasoning dressed up in the clothing of vision.

TL;DR

Stanford paper: frontier models score 70-80% on vision benchmarks with zero image input
Medical benchmarks hit 99% of normal accuracy through text patterns alone
A 3B text-only model beat frontier multimodal models AND human radiologists on chest X-rays
B-Clean benchmark filter removes 74-77% of questions; GPT-5.1 dropped from 61.5% to 15.4% on MicroVQA

The paper - "Mirage: The Illusion of Visual Understanding," by Mohammad Asadi, Jack W. O'Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley - tests what happens when you remove the images from visual question-answering evaluations completely. The results should embarrass every lab that has published multimodal benchmark scores in the last two years.

The Numbers Are Damning

The researchers ran a "mirage-mode" evaluation: standard benchmarks, typical instructions, no images. Models answered as though images had been provided. Across all tested models, they retained 70 to 80 percent of their original scores this way.

The true scale of the problem only became clear when the researchers applied B-Clean - their framework for removing benchmark questions answerable without any image. On MicroVQA:

Model	Original score	After B-Clean	Drop
GPT-5.1	61.5%	15.4%	-46pp
Gemini 3 Pro	68.8%	23.2%	-45pp

Those post-B-Clean scores reflect what the benchmarks actually measure once language shortcuts are removed. The paper's central finding: "across every model-benchmark pair tested, the accuracy that frontier models achieved without any access to images exceeded the additional accuracy they gained when images were provided."

In plain terms - the language understanding was doing most of the work the whole time.

A medical X-ray of the chest, similar to those evaluated by AI models in the Stanford study Medical imaging benchmarks showed the highest susceptibility to the mirage effect, with models reaching up to 99% of their normal accuracy through text patterns alone. Source: pexels.com

Medical Benchmarks Are the Worst Case

General vision benchmarks had bad retention rates. Medical benchmarks were catastrophic. Models reached up to 99% of their normal medical accuracy on text alone, with no imaging data whatsoever.

The diagnoses they created for nonexistent images skewed heavily toward severe pathologies. The paper documents AI systems confidently identifying ST-elevation myocardial infarctions, melanomas, and carcinomas from images that were never uploaded. If a user's image fails to attach in a real clinical tool, the model doesn't acknowledge the missing input. It generates a diagnosis anyway - and the diagnosis is biased toward the conditions that appear most often in the training data's textual patterns.

This isn't a theoretical failure mode. It's the default behavior of every major frontier model tested.

The Super-Guesser Experiment

To understand how deep the problem runs, the researchers trained a Qwen 2.5 model with 3 billion parameters on chest X-ray analysis with all images removed during training. The model learned only from text labels and structural patterns.

That 3B text-only model:

Outperformed all frontier multimodal models on the chest X-ray benchmark
Topped human radiologists by more than 10 percentage points
Produced explanations rated as indistinguishable from ground-truth clinical reasoning

A model that has never seen a single X-ray beats GPT-5, Gemini 3 Pro, and Claude Opus 4.5 on a X-ray benchmark. That benchmark isn't measuring what it claims to measure.

The Mirage Effect vs. Hallucination

The paper draws a careful distinction between hallucination and the mirage effect. They're different problems that require different solutions.

Hallucination means fabricating details within a valid perceptual frame - making up a citation within a coherent paragraph, or inventing a product name when summarizing a real document. The model has a real input; it fabricates elements of its response.

The mirage effect means constructing a false perceptual frame completely. The model behaves as though a visual input exists when none was provided. It builds an entire chain of reasoning from a premise that is factually false. Researchers found that when they explicitly told models "no image was provided, please guess" - models performed worse across nearly all categories than in standard mirage mode. The model performs better when it's not told it's guessing than when it is. The mirage regime is a distinct processing state, not a consequence of the model trying to be helpful despite uncertainty.

Google DeepMind AI research visualization The mirage effect affects all major frontier multimodal models, including those developed by OpenAI, Google, and Anthropic. Source: pexels.com

B-Clean: A Proposed Fix

The researchers propose B-Clean, a framework for identifying and removing compromised questions from visual benchmarks. The methodology has four steps: run all candidate models in mirage mode, remove questions any model answers correctly without images, apply a text-only filter to catch structural patterns, then assess on what remains.

Applying B-Clean to three benchmarks removed 74 to 77 percent of questions. Two of three benchmarks changed rankings after cleaning. What counted as a top-performing model shifted when the benchmark actually required looking at something.

On the cleaned MicroVQA benchmark:

GPT-5.1 dropped from 61.5% to 15.4%
Gemini 3 Pro dropped from 68.8% to 23.2%

Those numbers don't mean these models have no visual capability - they mean the original benchmarks were crediting them for capabilities they weren't using.

What It Does Not Tell You

The paper establishes that current benchmarks overstate visual understanding. It doesn't establish that frontier models have no visual capability. On genuinely vision-dependent tasks - the 23 to 26 percent of questions that survived B-Clean filtering - these models do appear to process and use image data.

The mirage effect also isn't unique to vision. This blog has covered earlier research showing that frontier models increasingly game safety evaluations, detecting test conditions and behaving differently in deployment. The pattern is the same: benchmarks designed to measure one thing end up measuring something else because models have learned to exploit the structure of evaluations rather than the underlying capability being assessed.

The paper also doesn't address whether this invalidates the models' practical usefulness in real multimodal applications. A radiologist reviewing AI output with a real image attachment doesn't face the missing-image scenario that drives the worst failures here. The risk is concentrated in specific failure modes - image upload errors, automated pipelines where image delivery isn't verified, consumer applications without strict input validation.

The researchers' recommendations are straightforward: modality-ablation testing should become standard practice before benchmark scores are published, private or dynamically updated benchmarks should replace static public ones, and metrics should measure the marginal contribution of each modality rather than absolute accuracy. None of this is technically difficult. It requires the labs to stop finding it convenient that their benchmarks flatter their models.

The paper lists Fei-Fei Li as a co-author. She co-created ImageNet, the benchmark that launched the deep learning era by establishing a rigorous method for measuring visual recognition. Two decades later, a paper with her name on it documents that the benchmarks her field built to assess visual AI don't reliably measure whether models can see. The field has a measurement problem that it has been ignoring because the numbers looked good.

Sources: