AI Vision Input Limits - What Every Provider Hides
A technical comparison of how Claude, GPT-4o, Gemini, Grok, Pixtral, Qwen, and DeepSeek handle image inputs - resizing pipelines, token math, and undocumented gotchas.

You send a detailed annotated screenshot to an AI model. The annotations are tiny - 12px text on a 4K capture. The model responds confidently, but gets every annotation wrong. It isn't hallucinating in the usual sense. It literally can't read them, because the provider silently downsized your image before the model ever saw it.
TL;DR
- Every major vision API silently resizes your images before processing, often to much smaller dimensions than you'd expect
- Token costs vary wildly: the same 2048x2048 image costs ~1,590 tokens on Claude, ~1,105 on GPT-4o (high detail), and just 280 on Gemini 3 (low)
- Pixtral is the only model that processes images at native resolution with no resizing
- Pre-resize your images to the provider's actual processing resolution to save money and reduce latency
I spent a week digging through documentation, API responses, and GitHub issues to map out exactly what happens to your images after you hit "send." Some of it's well-documented. Much of it isn't.
The Master Comparison
Before we get into the details, a complete reference table comparing every major provider's image handling as of March 2026.
| Provider | Max Input | Processing Resolution | Method | Token Formula | Detail Modes | Well Documented? |
|---|---|---|---|---|---|---|
| Claude (API) | 8000x8000 / 5MB | 1568px long edge (~1.15MP) | Aspect-preserving downscale | (w x h) / 750 | None | Yes |
| GPT-4o | No hard pixel cap / 20MB | 2048 box, then 768 shortest side, 512x512 tiles | 3-step tile pipeline | 85 + (170 x tiles) | low / high / auto | Yes |
| Gemini 3 | Not specified | Token-budget system | Resolution parameter | 280-2240 tokens per level | LOW / MED / HIGH / ULTRA_HIGH | Yes |
| Grok | 20MB | 448x448 tiles | Tile decomposition | (tiles + 1) x 256 | None | Partial |
| Pixtral | Varies by deployment | Native resolution | 16x16 patches + 2D RoPE | Proportional to pixels | None | Yes |
| Qwen2.5-VL | Configurable | Dynamic (~1MP default) | 28x28 patches | 4-16,384 visual tokens | User-configurable | Yes |
| Llama 4 | Provider-dependent | 448px tiles | Tile + global thumbnail | Provider-dependent | Provider-dependent | Partial |
| DeepSeek VL2 | Not specified | 384x384 tiles | Dynamic tiling (with caveats) | 729 embeddings/tile | None | Poor |
Claude - Two Different Pipelines
Anthropic's documentation is among the clearest in the industry. The official vision docs spell out the numbers: maximum input is 8000x8000 pixels (or 5MB via API, 10MB via claude.ai). If your image's long edge tops 1568 pixels, Claude downscales it while preserving the aspect ratio until it fits within ~1.15 megapixels.
The token formula is straightforward: (width x height) / 750. A 1092x1092 image (the sweet spot for 1:1 aspect ratio) uses roughly 1,590 tokens. There are no "detail modes" to toggle - you get one resolution, and it's consistent.
Image resolution matters: what the model actually "sees" depends completely on the provider's processing pipeline.
Source: unsplash.com
What isn't in the official docs is how client tools handle images before they even reach the API. Claude Code, Anthropic's CLI tool, has its own image preprocessing. Multiple GitHub issues (#20738, #12735) document that images get resized before transmission. If you're using Claude Code's Read tool to analyze screenshots, the image that reaches the model may be significantly smaller than what the API itself would accept.
For developers building on Claude's API directly, the 1568px limit is the real ceiling. For Claude Code users, be aware that your images may be compressed further. If you need pixel-level accuracy, use the API directly rather than relying on client-side tools.
Claude's Resolution Table
Anthropic publishes maximum dimensions that won't trigger a downscale:
| Aspect Ratio | Max Size (no resize) |
|---|---|
| 1:1 | 1092x1092 |
| 3:4 | 951x1268 |
| 2:3 | 896x1344 |
| 9:16 | 819x1456 |
| 1:2 | 784x1568 |
Anything larger gets scaled down, increasing latency with no quality benefit.
GPT-4o - The Three-Step Tile Pipeline
OpenAI's approach is the most thoroughly documented tiling system in the industry. GPT-4o processes images through three distinct stages:
Step 1: Scale the image so it fits within a 2048x2048 bounding box, preserving aspect ratio.
Step 2: Scale further so the shortest side equals 768 pixels.
Step 3: Divide the result into 512x512 tiles.
Token math works out to 85 + (170 x number_of_tiles). The base 85 tokens cover a low-resolution overview of the full image. Each 512x512 tile adds 170 tokens for detailed analysis.
A concrete example: a 2048x4096 screenshot. Step 1 scales it to 1024x2048. Step 2 scales to 768x1536. That's 6 tiles (3 wide, 2 tall at 512px each), costing 85 + (170 x 6) = 1,105 tokens.
OpenAI offers three detail modes. low bypasses the tile pipeline completely, processing everything at 512x512 for a flat 85 tokens. high runs the full pipeline. auto lets OpenAI decide based on the image. For document understanding tasks, high mode is nearly always worth the extra tokens.
Google Gemini 3 - Token Budgets Instead of Tiles
Google took a different approach with Gemini 3's media_resolution parameter. Instead of a fixed tile pipeline, you set a token budget per image:
- LOW: 280 tokens
- MEDIUM: 560 tokens
- HIGH: 1,120 tokens (default)
- ULTRA_HIGH: 2,240 tokens
The model internally decides how to allocate those tokens across the image. Gemini 3 also introduced per-part control - you can set different resolutions for different images in the same request. Send a complex diagram at ULTRA_HIGH and a simple contextual photo at LOW, all in one API call.
Gemini 2.5 models use a different system: 256 tokens per image at MEDIUM, with "Pan & Scan" at HIGH that dynamically allocates around 2,048 tokens. The model families aren't interchangeable in how they handle resolution.
This is the most flexible system of any provider, but it requires you to make explicit decisions about quality-cost tradeoffs per image.
The Rest of the Field
Grok (xAI)
Grok breaks images into 448x448 tiles, with each tile consuming 256 tokens. An extra tile is added for overhead, making the formula (tiles + 1) x 256. Only JPG and PNG formats are supported - no WebP, no GIF. The maximum file size is 20MB, but xAI's documentation doesn't specify a pixel dimension cap. Token usage is reported in the API response, which is useful for cost tracking but means you can't predict costs upfront without knowing how many tiles your image will produce.
Pixtral (Mistral)
Pixtral is the exception to everything in this article. It processes images at their native resolution and aspect ratio with no resizing, no padding, no tiling into fixed grids. This is possible because of its 2D RoPE (Rotary Position Embedding) implementation, which replaces traditional absolute position embeddings with relative ones that work in two dimensions.
Understanding the technical details behind image processing can help you optimize your API costs and get better results from vision models.
Source: unsplash.com
Each 16x16 pixel patch becomes a token, with the total count scaling linearly with the image's pixel count. For images up to 1024x1024, you get the full resolution with no artifacts from resizing. Beyond that, practical memory limits apply depending on your deployment. If your use case demands pixel-perfect fidelity - reading tiny text in screenshots, analyzing fine-grained diagrams - Pixtral's architecture is truly different from every other option on this list.
Qwen2.5-VL
Alibaba's Qwen2.5-VL offers the most configurable resolution handling. You set min_pixels and max_pixels parameters, and the model resizes images to maintain aspect ratio within those bounds. The default range produces 4 to 16,384 visual tokens.
The pixel-to-token conversion uses 28x28 patches (the ViT encoder uses 14x14 base patches, then merges 2x2 patches into one token). You can explicitly control the tradeoff. Set min_pixels = 256 * 28 * 28 and max_pixels = 1280 * 28 * 28 for a range of 256-1,280 tokens per image. This granularity is unmatched by closed-source providers, though it requires more engineering effort from the developer.
Llama 4 (Meta)
Llama 4 Scout and Maverick use a MetaCLIP-based vision encoder that processes images in 448x448 tiles (similar to Grok's tile size) with 14px patches. A global thumbnail - the entire image resized to 448x448 - is appended to the local tiles. Training involved up to 48 images simultaneously, but the models perform best with 8 or fewer.
Because Llama 4 is an open model, the actual resolution handling depends on where you deploy it. Groq, Together, and other inference providers each apply their own preprocessing and constraints. There's no single authoritative spec.
DeepSeek VL2 - The Three-Image Cliff
DeepSeek VL2 uses a SigLIP-based vision encoder with 384x384 tiles. When you send 1-2 images, it applies dynamic tiling - dividing high-resolution images into local tiles plus a global thumbnail. Each tile produces 729 visual embeddings.
Send 3 or more images, and the behavior changes completely. Every image gets padded to a single 384x384 tile with no dynamic tiling at all. Your detailed 4K screenshot gets the same treatment as a 400px thumbnail. This cliff is documented in the paper but easy to miss in practice. If you're building a multi-image comparison tool, this limitation rules out DeepSeek VL2 for workflows involving three or more high-resolution inputs.
What This Means in Practice
The practical consequences come down to three rules.
First, always check your provider's actual processing resolution. If Claude caps at 1568px on the long edge, sending a 4000px image wastes bandwidth and adds latency. The model sees the same thing either way.
Second, pre-resize images yourself. Don't let the provider's pipeline make decisions for you. A 1568px image you resized with proper interpolation will look better than a 4000px image that the API's server-side pipeline downscaled. You also save on transfer time and, for providers that charge by the byte, on data costs.
Third, for annotation and OCR tasks, crop regions instead of sending full images. If you need the model to read 12px text in a corner of a 4K screenshot, crop that corner and send it as a separate image. A 500x500 crop at native resolution beats a 4000x3000 screenshot downscaled to 1568px where your target text is now 3 pixels tall.
Recommendations by Use Case
| Use Case | Best Provider | Why |
|---|---|---|
| Screenshot annotation (small text) | Pixtral or Claude (API, pre-cropped) | Native resolution or known 1568px ceiling |
| Document OCR | GPT-4o (high detail) or Gemini 3 (ULTRA_HIGH) | Tile system preserves text; token budget flexibility |
| Photo analysis | Claude or Gemini 3 (HIGH) | Good balance of quality and cost |
| Multi-image comparison (3+) | GPT-4o or Claude | Avoid DeepSeek VL2's 3-image cliff |
| Cost-sensitive batch processing | Gemini 3 (LOW) at 280 tokens/image | Lowest per-image token cost |
| Maximum configurability | Qwen2.5-VL (self-hosted) | Full control over min/max pixels |
FAQ
Does sending a larger image improve AI vision accuracy?
No. Every provider except Pixtral downscales to a fixed processing resolution. Sending 4000px when the model processes at 1568px just wastes bandwidth and adds latency.
Which AI model handles image resolution best?
For raw fidelity, Pixtral processes images at native resolution. For practical API use, GPT-4o's tiling system and Gemini 3's token budgets both offer strong control over quality-cost tradeoffs.
How many tokens does an image cost in Claude?
Claude uses the formula (width x height) / 750. A 1092x1092 image costs roughly 1,590 tokens. Images larger than 1568px on any edge get downscaled first.
Can I control image resolution in the Gemini API?
Yes. Gemini 3 accepts a media_resolution parameter with four levels: LOW (280 tokens), MEDIUM (560), HIGH (1,120 default), and ULTRA_HIGH (2,240). You can set it per-image.
Why does DeepSeek VL2 perform worse with multiple images?
With 3 or more images, DeepSeek VL2 drops dynamic tiling and pads every image to a single 384x384 tile. This is a known architectural trade-off documented in the model's paper.
Does GPT-4o's "auto" detail mode always pick the right resolution?
No guarantee. When precision matters, set detail to "high" explicitly. The "auto" mode optimizes for cost-quality balance, which may not match your specific use case.
Sources:
- Claude Vision Documentation - Anthropic
- OpenAI Images and Vision Guide - OpenAI
- Gemini Media Resolution Parameter - Google
- Grok Image Understanding - xAI
- Pixtral 12B Announcement - Mistral AI
- Qwen2.5-VL Model Card - Alibaba/Qwen
- DeepSeek-VL2 Paper - DeepSeek
- Llama 4 Model Documentation - Meta/HuggingFace
- Claude Code Image Resize Feature Request - GitHub
- GPT-4o Vision Guide - GetStream
✓ Last verified March 29, 2026
