Name: Google Gemma 3 27B
Author: Google DeepMind

TL;DR

27B dense model with native multimodal support (text + images) - runs on a single high-end GPU
128K context window with 140+ language support - trained on 14 trillion tokens
MMLU 78.6%, MMLU-Pro 67.5%, GPQA Diamond 42.4% - strong for its size, not frontier-class
Gemma Terms of Use (permissive, commercial OK) - available for self-hosting and via multiple API providers
The best open model you can run on consumer hardware with multimodal capability

Overview

Google DeepMind released Gemma 3 27B on March 12, 2025, and the positioning is clear: this is the most capable open model that fits on a single GPU. At 27 billion parameters (dense, not MoE), it can run in bfloat16 on a single 80GB A100 or H100, or with quantization on even smaller setups. That constraint - single GPU, full multimodal, 128K context - defines where Gemma 3 27B competes. It isn't trying to be a frontier model. It's trying to be the best model you can actually deploy on your own hardware without a multi-node cluster.

The benchmarks reflect this positioning. MMLU at 78.6% and MMLU-Pro at 67.5% are strong for a 27B model - better than many 70B-class alternatives. On the LMArena chatbot leaderboard, Gemma 3 27B posts an Elo of 1338, beating Llama 3.1 405B in human preference despite being 15x smaller. The multimodal benchmarks are where it differentiates: DocVQA 85.6%, ChartQA 76.3%, AI2D 79.0%. These aren't frontier scores, but they're competitive with models 3-5x larger. For document understanding, chart analysis, and visual question answering on a single GPU, nothing else in the open-weight space matches this combination.

Where Gemma 3 27B falls short is reasoning depth. GPQA Diamond at 42.4% puts it below Mistral Large 3 (43.9%) and far below the proprietary frontier (Claude Opus 4.6 at 91.3%). The gap is expected - you can't get PhD-level reasoning from 27B parameters today. In the same way, coding benchmarks (HumanEval, SWE-bench) sit below what the Qwen 3.5 series achieves at similar parameter counts. The value of Gemma 3 27B isn't in pushing benchmark ceilings. It's in delivering usable multimodal performance within a hardware budget that individual developers and small teams can actually afford.

Key Specifications

Specification	Details
Provider	Google DeepMind
Model Family	Gemma 3
Architecture	Transformer (dense), Grouped-Query Attention (GQA)
Parameters	27.4B
Training Data	14 trillion tokens
Context Window	128,000 tokens input / 8,192 tokens output
Attention Pattern	5:1 local (1024 window) to global attention ratio
Input Modalities	Text, Images (normalized to 896x896, 256 tokens each)
Output Modality	Text
Languages	140+
Release Date	March 12, 2025
License	Gemma Terms of Use (permissive, commercial use allowed)
VRAM Requirement	~55-60 GB (bfloat16), lower with quantization
Training Hardware	TPUv4p, TPUv5p, TPUv5e

Benchmark Performance

Benchmark	Gemma 3 27B	Qwen 3.5-27B	Llama 3.3 70B	Mistral Large 3
MMLU (5-shot)	78.6%	82.1%	79.2%	-
MMLU-Pro	67.5%	74.8%	68.9%	73.1%
GPQA Diamond	42.4%	62.5%	46.7%	43.9%
HumanEval (pass@1)	78.5%	88.4%	82.0%	92.0%
DocVQA (vision)	85.6%	84.2%	-	82.0%
ChartQA (vision)	76.3%	74.8%	-	72.5%
MMMU (visual reasoning)	64.2%	82.3%	-	62.5%
MGSM (multilingual math, 2-shot)	74.3%	82.6%	78.4%	80.1%
LMArena Elo	1338	1420	1352	1418

The benchmark story for Gemma 3 27B is one of solid all-around performance with specific multimodal strengths. It wins or ties on DocVQA and ChartQA - document understanding and chart interpretation are where the vision encoder earns its keep. On text-only benchmarks, the newer Qwen 3.5-27B generally outperforms it, which is expected given the 10-month gap in release dates. Llama 3.3 70B is competitive on text tasks but lacks multimodal capability entirely, and requires roughly 2.5x more compute.

The 5:1 local-to-global attention ratio is worth noting. Gemma 3 uses mostly local attention (1024-token windows) with periodic full attention layers. This reduces compute on long sequences but can limit the model's ability to reason across distant parts of a long context. For most practical workloads under 32K tokens, the architecture performs well. For 128K-token inputs that require cross-document reasoning, the architectural tradeoff may show.

Key Capabilities

Multimodal Vision. Gemma 3 27B processes images natively - they're normalized to 896x896 pixels and encoded into 256 tokens each. This means you can send a document scan, a chart screenshot, or a photograph with text and get coherent cross-modal responses. DocVQA at 85.6% means the model can reliably extract information from document images. ChartQA at 76.3% means it can interpret charts and answer questions about visual data. For teams building document processing pipelines, visual QA systems, or multimodal chatbots on a single GPU, this is the practical sweet spot.

Multilingual Breadth. With training covering 140+ languages and MGSM (multilingual math) at 74.3%, Gemma 3 27B offers genuine multilingual utility. The Global-MMLU-Lite score of 75.7% and XQuAD score of 76.8% confirm that performance holds across non-English languages. This matters for applications serving global user bases where a single model needs to handle queries in any language without separate per-language deployments.

Efficient Deployment. The 27B dense architecture is straightforward to serve - no MoE routing complexity, no expert balancing issues. In bfloat16, it fits on a single A100 80GB or H100. With 4-bit quantization (GPTQ or AWQ), it can run on consumer GPUs with 24GB VRAM, though with quality degradation on harder tasks. The model is compatible with the standard HuggingFace Transformers pipeline (>= v4.50.0), vLLM, and most common serving frameworks. This simplicity of deployment is a genuine advantage over MoE architectures that require more careful optimization.

Pricing and Availability

Gemma 3 27B is available under the Gemma Terms of Use, which permit commercial use with some restrictions (you must accept Google's license agreement). The model weights are downloadable from HuggingFace after accepting the license.

For managed API access, several providers offer Gemma 3 27B hosting. Google AI Studio provides access, and third-party providers offer it starting around $0.04/M input tokens and $0.15/M output tokens. For comparison, self-hosting on a single A100 instance (roughly $2-3/hour on cloud providers) is often more cost-effective for sustained workloads.

The model is also available through Ollama for local deployment: ollama pull gemma3:27b. Quantized versions are available on HuggingFace for lower VRAM requirements.

Strengths

Best multimodal open model at the single-GPU size class (DocVQA 85.6%, ChartQA 76.3%)
128K context window with 140+ language support in a 27B dense model
Runs on a single A100/H100 in bfloat16 - no multi-node infrastructure needed
LMArena Elo of 1338 beats Llama 3.1 405B in human preference
Clean deployment story - standard Transformers, vLLM, Ollama all supported
Permissive license allows commercial deployment

Weaknesses

GPQA Diamond (42.4%) and MMLU-Pro (67.5%) lag behind newer models like Qwen 3.5-27B
Output limited to 8,192 tokens - a hard cap that constrains long-form generation
Vision limited to images - no video or audio processing
Released March 2025 - already showing age against mid-to-late 2025 models on text benchmarks
Gemma Terms of Use isn't a standard OSI-approved open-source license
Local-to-global attention ratio may limit very long-context cross-document reasoning

Qwen 3.5-27B - The newer dense 27B competitor from Alibaba
Open Source LLM Leaderboard - Current rankings for open-weight models
Coding Benchmarks Leaderboard - HumanEval and SWE-bench rankings
Open Source vs Proprietary AI - When to choose open-weight models