Name: Cohere Command A Vision: 112B Multimodal Model
Author: Cohere

Cohere released Command A Vision on July 31, 2025 as a 112B multimodal extension of its Command A text model. Built for enterprise document processing, it integrates a SigLIP2 vision encoder with the existing 111B Command A text tower to produce a model that processes documents, charts, and tables at competitive accuracy with far lower deployment costs than closed API alternatives.

TL;DR

Best-in-class on document benchmarks: DocVQA 95.9%, OCRBench 86.9%, ChartQA 90.9%
112B parameters, 128K context window, deployable on 2x A100 GPUs; CC-BY-NC license with commercial access via Cohere sales
Beats GPT-4.1 on 7 of 9 vision benchmarks, but trails it by 9.5 points on MMMU (65.3% vs 74.8%)

Overview

Command A Vision fills a specific gap in Cohere's product line: a self-hostable multimodal model aimed at enterprises with document-heavy workloads. The base Command A model launched in March 2025 as a 111B text model with a 256K context window, strong tool-use, and pricing at $2.50 per million input tokens. The vision variant adds a visual understanding stack on top of that foundation rather than training a new architecture from scratch.

The result is a model with a narrower language footprint for vision tasks - six languages (English, Portuguese, Italian, French, German, Spanish) versus the 23 supported by the text model - but notably stronger document-oriented performance than most competitors. On DocVQA, a benchmark measuring accuracy on real-world document question-answering, Command A Vision scores 95.9% against GPT-4.1's 88.6% and Llama 4 Maverick's 94.4%. These are Cohere's internal evaluations, not third-party reproductions, so treat competitor numbers as directional.

Where the model doesn't compete as well is in general visual reasoning. Its MMMU score of 65.3% puts it behind both GPT-4.1 (74.8%) and Llama 4 Maverick (73.4%). MMMU measures college-level multi-discipline reasoning from images - the kind of task where document specialization doesn't help. If your use case centers on reading PDFs, extracting tables, or parsing charts, Command A Vision looks strong. If you need a general-purpose visual reasoner, GPT-4.1 or Llama 4 Maverick may be better options.

Key Specifications

Specification	Details
Provider	Cohere / CohereLabs
Model Family	Command A
Parameters	112B (111B text tower + vision components)
Context Window	128K tokens (vision mode)
Input Price	Not publicly disclosed
Output Price	Not publicly disclosed
Release Date	July 31, 2025
License	CC-BY-NC (commercial use via Cohere sales)
GPU Requirement	Minimum 2x A100 80GB GPUs
Vision Encoder	SigLIP2-patch16-512
Max Image Tiles	12 tiles of 512x512 + 1 global thumbnail
Max Visual Tokens	3,328 per image
Languages (vision)	English, French, German, Italian, Portuguese, Spanish

Benchmark Performance

All scores below are from Cohere's internal evaluation suite published with the July 2025 release. Competitor scores that were not available from official sources are Cohere's best-effort reproductions. See the multimodal benchmarks leaderboard for third-party tracking.

Benchmark	Command A Vision	GPT-4.1	Llama 4 Maverick	Pixtral Large
ChartQA	90.9%	82.7%	90.0%	88.1%
DocVQA	95.9%	88.6%	94.4%	93.3%
OCRBench	86.9%	83.4%	80.0%	74.1%
TextVQA	84.8%	71.1%	81.6%	79.3%
MMMU (CoT)	65.3%	74.8%	73.4%	64.0%
AI2D	94.0%	86.5%	84.4%	93.8%
InfoVQA	82.9%	70.0%	77.1%	59.9%
MathVista	73.5%	72.2%	73.7%	69.4%
RealWorldQA	73.6%	78.0%	70.4%	69.3%
Average	83.1%	78.6%	80.5%	76.8%

The pattern is consistent across document-centric benchmarks. Command A Vision leads by 7.3 points over GPT-4.1 on DocVQA and by 6.7 points on OCRBench. These are not marginal differences. On ChartQA, which tests chart and graph comprehension, it leads Llama 4 Maverick by 0.9 points and GPT-4.1 by 8.2 points.

The weakness is general reasoning under MMMU. A 9.5-point gap behind GPT-4.1 is substantial for anyone building science, medicine, or law workflows where the model needs to reason about diagrams rather than extract text from them. Cohere made an architectural trade-off here: the SigLIP2 encoder excels at reading document structure, not at interpreting complex visual scenes.

A robot pointing at a digital display, representing AI visual understanding Command A Vision processes images through a SigLIP2 encoder, splitting each image into up to 12 tiles of 512x512 pixels for high-resolution document parsing. Source: pexels.com

Key Capabilities

Command A Vision's strongest use cases are structured document processing tasks: reading scanned PDFs, extracting data from financial tables, parsing government forms, and interpreting charts in research reports. The InfoVQA score of 82.9% - 12.9 points ahead of GPT-4.1 - shows it handles infographic-heavy documents better than anything in the comparison set.

The image tiling architecture supports high-resolution documents without downsampling. Images map to up to 12 tiles of 512x512 pixels plus a global thumbnail, totaling up to 3,328 visual tokens per image with a maximum of 20 images per request. Each tile resolves to 256 tokens, which means a dense document page retains sufficient pixel-level detail for character-level OCR. The recommended maximum is 2048x1536 pixels (3 megapixels) per image.

What the model explicitly doesn't do: tool use and function calling, which are available in the base Command A and the Command A Reasoning variant. There's also no image generation - this is a text-output model only. For agentic workflows that need both vision and tool-calling, Cohere's Command A Reasoning model is the documented option, though that variant doesn't include vision capabilities either. The two aren't yet combined in a single released model.

Pricing and Availability

Cohere doesn't publish per-token pricing for Command A Vision. The base Command A text model is priced at $2.50 per million input tokens and $10.00 per million output tokens through the Cohere API and OpenRouter - but those rates don't automatically apply to the vision variant.

The model weights are available under a CC-BY-NC license for non-commercial research via HuggingFace at CohereLabs/command-a-vision-07-2025. Commercial deployment requires contacting Cohere's sales team. This puts it in a similar position to Meta's Llama licensing for commercial use - technically open weights, practically enterprise-gated.

Deployment needs at least two A100 80GB GPUs. An H100 with 4-bit quantization also works as a single-GPU option. This is heavier than running a 70B model but lighter than what most 110B+ dense models require, partly because the architecture was designed with two-GPU deployment in mind from the start.

The 9.5-point MMMU gap behind GPT-4.1 matters. Document extraction and general visual reasoning are different tasks - Command A Vision is built for one of them.

Strengths and Weaknesses

Strengths

Leads the comparison group on DocVQA (95.9%), OCRBench (86.9%), and InfoVQA (82.9%) - strong document processing credentials
Open weights under CC-BY-NC enable local or private-cloud deployment without per-token API costs at scale
Supports up to 20 images per request with high-resolution tiling up to 2048x1536 pixels per image
83.1% average across 9 benchmarks beats GPT-4.1 (78.6%), Pixtral Large (76.8%), and Llama 4 Maverick (80.5%)

Weaknesses

MMMU (CoT) score of 65.3% trails GPT-4.1 by 9.5 points - weaker general visual reasoning
No tool use or function calling in the vision variant
Pricing for commercial API use isn't public; requires sales contact
Vision language support is limited to 6 languages versus 23 for the base text model
Requires 2x A100 GPUs for inference - not a small-scale option

Multimodal Benchmarks Leaderboard - how Command A Vision fits in the broader vision model rankings
Llama 4 Maverick - Meta's competing open-weight multimodal model, stronger on MMMU but weaker on DocVQA
Cohere: Aya Multilingual Model - prior Cohere work on multilingual coverage

FAQ

Is Command A Vision free to use?

The model weights are free for non-commercial use under CC-BY-NC. Commercial deployment requires a Cohere enterprise contract. There's no confirmed public API pricing for the vision variant.

How many images can I send per request?

Up to 20 images per request, with a total payload cap of 20MB. Each image supports up to 12 tiles at 512x512 pixels plus one global thumbnail.

Does it support tool use?

No. Tool use and function calling are available in the base Command A and Command A Reasoning models, but not in the vision variant as of the July 2025 release.

How does it compare to GPT-4.1 for documents?

It leads GPT-4.1 by 7.3 points on DocVQA (95.9% vs 88.6%) and 3.5 points on OCRBench (86.9% vs 83.4%). For general visual reasoning (MMMU), GPT-4.1 is 9.5 points ahead.

What languages does Command A Vision support?

Six languages for vision tasks: English, French, German, Italian, Portuguese, and Spanish. The underlying text model supports 23 languages, but the vision benchmarking and recommended use is narrower.

What GPU do I need to run it locally?

Minimum of two A100 80GB GPUs. A single H100 with 4-bit quantization is also supported.

Sources: