AI Vision Input Limits - What Every Provider Hides

A technical comparison of how Claude, GPT-4o, Gemini, Grok, Pixtral, Qwen, and DeepSeek handle image inputs - resizing pipelines, token math, and undocumented gotchas.

AI Vision Input Limits - What Every Provider Hides

You send a detailed annotated screenshot to an AI model. The annotations are tiny - 12px text on a 4K capture. The model responds confidently, but gets every annotation wrong. It isn't hallucinating in the usual sense. It literally can't read them, because the provider silently downsized your image before the model ever saw it.

TL;DR

  • Every major vision API silently resizes your images before processing, often to much smaller dimensions than you'd expect
  • Token costs vary wildly: the same 2048x2048 image costs ~1,590 tokens on Claude, ~1,105 on GPT-4o (high detail), and just 280 on Gemini 3 (low)
  • Pixtral is the only model that processes images at native resolution with no resizing
  • Pre-resize your images to the provider's actual processing resolution to save money and reduce latency

I spent a week digging through documentation, API responses, and GitHub issues to map out exactly what happens to your images after you hit "send." Some of it's well-documented. Much of it isn't.

The Master Comparison

Before we get into the details, a complete reference table comparing every major provider's image handling as of March 2026.

ProviderMax InputProcessing ResolutionMethodToken FormulaDetail ModesWell Documented?
Claude (API)8000x8000 / 5MB1568px long edge (~1.15MP)Aspect-preserving downscale(w x h) / 750NoneYes
GPT-4oNo hard pixel cap / 20MB2048 box, then 768 shortest side, 512x512 tiles3-step tile pipeline85 + (170 x tiles)low / high / autoYes
Gemini 3Not specifiedToken-budget systemResolution parameter280-2240 tokens per levelLOW / MED / HIGH / ULTRA_HIGHYes
Grok20MB448x448 tilesTile decomposition(tiles + 1) x 256NonePartial
PixtralVaries by deploymentNative resolution16x16 patches + 2D RoPEProportional to pixelsNoneYes
Qwen2.5-VLConfigurableDynamic (~1MP default)28x28 patches4-16,384 visual tokensUser-configurableYes
Llama 4Provider-dependent448px tilesTile + global thumbnailProvider-dependentProvider-dependentPartial
DeepSeek VL2Not specified384x384 tilesDynamic tiling (with caveats)729 embeddings/tileNonePoor

Claude - Two Different Pipelines

Anthropic's documentation is among the clearest in the industry. The official vision docs spell out the numbers: maximum input is 8000x8000 pixels (or 5MB via API, 10MB via claude.ai). If your image's long edge tops 1568 pixels, Claude downscales it while preserving the aspect ratio until it fits within ~1.15 megapixels.

The token formula is straightforward: (width x height) / 750. A 1092x1092 image (the sweet spot for 1:1 aspect ratio) uses roughly 1,590 tokens. There are no "detail modes" to toggle - you get one resolution, and it's consistent.

Photography equipment showing how different lenses capture at different resolutions Image resolution matters: what the model actually "sees" depends completely on the provider's processing pipeline. Source: unsplash.com

What isn't in the official docs is how client tools handle images before they even reach the API. Claude Code, Anthropic's CLI tool, has its own image preprocessing. Multiple GitHub issues (#20738, #12735) document that images get resized before transmission. If you're using Claude Code's Read tool to analyze screenshots, the image that reaches the model may be significantly smaller than what the API itself would accept.

For developers building on Claude's API directly, the 1568px limit is the real ceiling. For Claude Code users, be aware that your images may be compressed further. If you need pixel-level accuracy, use the API directly rather than relying on client-side tools.

Claude's Resolution Table

Anthropic publishes maximum dimensions that won't trigger a downscale:

Aspect RatioMax Size (no resize)
1:11092x1092
3:4951x1268
2:3896x1344
9:16819x1456
1:2784x1568

Anything larger gets scaled down, increasing latency with no quality benefit.

GPT-4o - The Three-Step Tile Pipeline

OpenAI's approach is the most thoroughly documented tiling system in the industry. GPT-4o processes images through three distinct stages:

Step 1: Scale the image so it fits within a 2048x2048 bounding box, preserving aspect ratio.

Step 2: Scale further so the shortest side equals 768 pixels.

Step 3: Divide the result into 512x512 tiles.

Token math works out to 85 + (170 x number_of_tiles). The base 85 tokens cover a low-resolution overview of the full image. Each 512x512 tile adds 170 tokens for detailed analysis.

A concrete example: a 2048x4096 screenshot. Step 1 scales it to 1024x2048. Step 2 scales to 768x1536. That's 6 tiles (3 wide, 2 tall at 512px each), costing 85 + (170 x 6) = 1,105 tokens.

OpenAI offers three detail modes. low bypasses the tile pipeline completely, processing everything at 512x512 for a flat 85 tokens. high runs the full pipeline. auto lets OpenAI decide based on the image. For document understanding tasks, high mode is nearly always worth the extra tokens.

Google Gemini 3 - Token Budgets Instead of Tiles

Google took a different approach with Gemini 3's media_resolution parameter. Instead of a fixed tile pipeline, you set a token budget per image:

  • LOW: 280 tokens
  • MEDIUM: 560 tokens
  • HIGH: 1,120 tokens (default)
  • ULTRA_HIGH: 2,240 tokens

The model internally decides how to allocate those tokens across the image. Gemini 3 also introduced per-part control - you can set different resolutions for different images in the same request. Send a complex diagram at ULTRA_HIGH and a simple contextual photo at LOW, all in one API call.

Gemini 2.5 models use a different system: 256 tokens per image at MEDIUM, with "Pan & Scan" at HIGH that dynamically allocates around 2,048 tokens. The model families aren't interchangeable in how they handle resolution.

This is the most flexible system of any provider, but it requires you to make explicit decisions about quality-cost tradeoffs per image.

The Rest of the Field

Grok (xAI)

Grok breaks images into 448x448 tiles, with each tile consuming 256 tokens. An extra tile is added for overhead, making the formula (tiles + 1) x 256. Only JPG and PNG formats are supported - no WebP, no GIF. The maximum file size is 20MB, but xAI's documentation doesn't specify a pixel dimension cap. Token usage is reported in the API response, which is useful for cost tracking but means you can't predict costs upfront without knowing how many tiles your image will produce.

Pixtral (Mistral)

Pixtral is the exception to everything in this article. It processes images at their native resolution and aspect ratio with no resizing, no padding, no tiling into fixed grids. This is possible because of its 2D RoPE (Rotary Position Embedding) implementation, which replaces traditional absolute position embeddings with relative ones that work in two dimensions.

Code comparison showing different image processing approaches across AI models Understanding the technical details behind image processing can help you optimize your API costs and get better results from vision models. Source: unsplash.com

Each 16x16 pixel patch becomes a token, with the total count scaling linearly with the image's pixel count. For images up to 1024x1024, you get the full resolution with no artifacts from resizing. Beyond that, practical memory limits apply depending on your deployment. If your use case demands pixel-perfect fidelity - reading tiny text in screenshots, analyzing fine-grained diagrams - Pixtral's architecture is truly different from every other option on this list.

Qwen2.5-VL

Alibaba's Qwen2.5-VL offers the most configurable resolution handling. You set min_pixels and max_pixels parameters, and the model resizes images to maintain aspect ratio within those bounds. The default range produces 4 to 16,384 visual tokens.

The pixel-to-token conversion uses 28x28 patches (the ViT encoder uses 14x14 base patches, then merges 2x2 patches into one token). You can explicitly control the tradeoff. Set min_pixels = 256 * 28 * 28 and max_pixels = 1280 * 28 * 28 for a range of 256-1,280 tokens per image. This granularity is unmatched by closed-source providers, though it requires more engineering effort from the developer.

Llama 4 (Meta)

Llama 4 Scout and Maverick use a MetaCLIP-based vision encoder that processes images in 448x448 tiles (similar to Grok's tile size) with 14px patches. A global thumbnail - the entire image resized to 448x448 - is appended to the local tiles. Training involved up to 48 images simultaneously, but the models perform best with 8 or fewer.

Because Llama 4 is an open model, the actual resolution handling depends on where you deploy it. Groq, Together, and other inference providers each apply their own preprocessing and constraints. There's no single authoritative spec.

DeepSeek VL2 - The Three-Image Cliff

DeepSeek VL2 uses a SigLIP-based vision encoder with 384x384 tiles. When you send 1-2 images, it applies dynamic tiling - dividing high-resolution images into local tiles plus a global thumbnail. Each tile produces 729 visual embeddings.

Send 3 or more images, and the behavior changes completely. Every image gets padded to a single 384x384 tile with no dynamic tiling at all. Your detailed 4K screenshot gets the same treatment as a 400px thumbnail. This cliff is documented in the paper but easy to miss in practice. If you're building a multi-image comparison tool, this limitation rules out DeepSeek VL2 for workflows involving three or more high-resolution inputs.

What This Means in Practice

The practical consequences come down to three rules.

First, always check your provider's actual processing resolution. If Claude caps at 1568px on the long edge, sending a 4000px image wastes bandwidth and adds latency. The model sees the same thing either way.

Second, pre-resize images yourself. Don't let the provider's pipeline make decisions for you. A 1568px image you resized with proper interpolation will look better than a 4000px image that the API's server-side pipeline downscaled. You also save on transfer time and, for providers that charge by the byte, on data costs.

Third, for annotation and OCR tasks, crop regions instead of sending full images. If you need the model to read 12px text in a corner of a 4K screenshot, crop that corner and send it as a separate image. A 500x500 crop at native resolution beats a 4000x3000 screenshot downscaled to 1568px where your target text is now 3 pixels tall.

Recommendations by Use Case

Use CaseBest ProviderWhy
Screenshot annotation (small text)Pixtral or Claude (API, pre-cropped)Native resolution or known 1568px ceiling
Document OCRGPT-4o (high detail) or Gemini 3 (ULTRA_HIGH)Tile system preserves text; token budget flexibility
Photo analysisClaude or Gemini 3 (HIGH)Good balance of quality and cost
Multi-image comparison (3+)GPT-4o or ClaudeAvoid DeepSeek VL2's 3-image cliff
Cost-sensitive batch processingGemini 3 (LOW) at 280 tokens/imageLowest per-image token cost
Maximum configurabilityQwen2.5-VL (self-hosted)Full control over min/max pixels

FAQ

Does sending a larger image improve AI vision accuracy?

No. Every provider except Pixtral downscales to a fixed processing resolution. Sending 4000px when the model processes at 1568px just wastes bandwidth and adds latency.

Which AI model handles image resolution best?

For raw fidelity, Pixtral processes images at native resolution. For practical API use, GPT-4o's tiling system and Gemini 3's token budgets both offer strong control over quality-cost tradeoffs.

How many tokens does an image cost in Claude?

Claude uses the formula (width x height) / 750. A 1092x1092 image costs roughly 1,590 tokens. Images larger than 1568px on any edge get downscaled first.

Can I control image resolution in the Gemini API?

Yes. Gemini 3 accepts a media_resolution parameter with four levels: LOW (280 tokens), MEDIUM (560), HIGH (1,120 default), and ULTRA_HIGH (2,240). You can set it per-image.

Why does DeepSeek VL2 perform worse with multiple images?

With 3 or more images, DeepSeek VL2 drops dynamic tiling and pads every image to a single 384x384 tile. This is a known architectural trade-off documented in the model's paper.

Does GPT-4o's "auto" detail mode always pick the right resolution?

No guarantee. When precision matters, set detail to "high" explicitly. The "auto" mode optimizes for cost-quality balance, which may not match your specific use case.


Sources:

✓ Last verified March 29, 2026

AI Vision Input Limits - What Every Provider Hides
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.