Models

Google Gemma 3 27B

Google Gemma 3 27B is a 27B dense multimodal model supporting text and vision with a 128K context window, 140+ languages, and single-GPU deployment - the most capable open model at its size class.

Google Gemma 3 27B

TL;DR

  • 27B dense model with native multimodal support (text + images) - runs on a single high-end GPU
  • 128K context window with 140+ language support - trained on 14 trillion tokens
  • MMLU 78.6%, MMLU-Pro 67.5%, GPQA Diamond 42.4% - strong for its size, not frontier-class
  • Gemma Terms of Use (permissive, commercial OK) - available for self-hosting and via multiple API providers
  • The best open model you can run on consumer hardware with multimodal capability

Overview

Google DeepMind released Gemma 3 27B on March 12, 2025, and the positioning is clear: this is the most capable open model that fits on a single GPU. At 27 billion parameters (dense, not MoE), it can run in bfloat16 on a single 80GB A100 or H100, or with quantization on even smaller setups. That constraint - single GPU, full multimodal, 128K context - defines where Gemma 3 27B competes. It is not trying to be a frontier model. It is trying to be the best model you can actually deploy on your own hardware without a multi-node cluster.

The benchmarks reflect this positioning. MMLU at 78.6% and MMLU-Pro at 67.5% are strong for a 27B model - better than many 70B-class alternatives. On the LMArena chatbot leaderboard, Gemma 3 27B posts an Elo of 1338, outperforming Llama 3.1 405B in human preference despite being 15x smaller. The multimodal benchmarks are where it differentiates: DocVQA 85.6%, ChartQA 76.3%, AI2D 79.0%. These are not frontier scores, but they are competitive with models 3-5x larger. For document understanding, chart analysis, and visual question answering on a single GPU, nothing else in the open-weight space matches this combination.

Where Gemma 3 27B falls short is reasoning depth. GPQA Diamond at 42.4% puts it below Mistral Large 3 (43.9%) and far below the proprietary frontier (Claude Opus 4.6 at 91.3%). The gap is expected - you cannot get PhD-level reasoning from 27B parameters today. Similarly, coding benchmarks (HumanEval, SWE-bench) sit below what the Qwen 3.5 series achieves at similar parameter counts. The value of Gemma 3 27B is not in pushing benchmark ceilings. It is in delivering usable multimodal performance within a hardware budget that individual developers and small teams can actually afford.

Key Specifications

SpecificationDetails
ProviderGoogle DeepMind
Model FamilyGemma 3
ArchitectureTransformer (dense), Grouped-Query Attention (GQA)
Parameters27.4B
Training Data14 trillion tokens
Context Window128,000 tokens input / 8,192 tokens output
Attention Pattern5:1 local (1024 window) to global attention ratio
Input ModalitiesText, Images (normalized to 896x896, 256 tokens each)
Output ModalityText
Languages140+
Release DateMarch 12, 2025
LicenseGemma Terms of Use (permissive, commercial use allowed)
VRAM Requirement~55-60 GB (bfloat16), lower with quantization
Training HardwareTPUv4p, TPUv5p, TPUv5e

Benchmark Performance

BenchmarkGemma 3 27BQwen 3.5-27BLlama 3.3 70BMistral Large 3
MMLU (5-shot)78.6%82.1%79.2%-
MMLU-Pro67.5%74.8%68.9%73.1%
GPQA Diamond42.4%62.5%46.7%43.9%
HumanEval (pass@1)78.5%88.4%82.0%92.0%
DocVQA (vision)85.6%84.2%-82.0%
ChartQA (vision)76.3%74.8%-72.5%
MMMU (visual reasoning)64.2%82.3%-62.5%
MGSM (multilingual math, 2-shot)74.3%82.6%78.4%80.1%
LMArena Elo1338142013521418

The benchmark story for Gemma 3 27B is one of solid all-around performance with specific multimodal strengths. It wins or ties on DocVQA and ChartQA - document understanding and chart interpretation are where the vision encoder earns its keep. On text-only benchmarks, the newer Qwen 3.5-27B generally outperforms it, which is expected given the 10-month gap in release dates. Llama 3.3 70B is competitive on text tasks but lacks multimodal capability entirely, and requires roughly 2.5x more compute.

The 5:1 local-to-global attention ratio is worth noting. Gemma 3 uses mostly local attention (1024-token windows) with periodic full attention layers. This reduces compute on long sequences but can limit the model's ability to reason across distant parts of a long context. For most practical workloads under 32K tokens, the architecture performs well. For 128K-token inputs that require cross-document reasoning, the architectural tradeoff may show.

Key Capabilities

Multimodal Vision. Gemma 3 27B processes images natively - they are normalized to 896x896 pixels and encoded into 256 tokens each. This means you can send a document scan, a chart screenshot, or a photograph alongside text and get coherent cross-modal responses. DocVQA at 85.6% means the model can reliably extract information from document images. ChartQA at 76.3% means it can interpret charts and answer questions about visual data. For teams building document processing pipelines, visual QA systems, or multimodal chatbots on a single GPU, this is the practical sweet spot.

Multilingual Breadth. With training covering 140+ languages and MGSM (multilingual math) at 74.3%, Gemma 3 27B offers genuine multilingual utility. The Global-MMLU-Lite score of 75.7% and XQuAD score of 76.8% confirm that performance holds across non-English languages. This matters for applications serving global user bases where a single model needs to handle queries in any language without separate per-language deployments.

Efficient Deployment. The 27B dense architecture is straightforward to serve - no MoE routing complexity, no expert balancing issues. In bfloat16, it fits on a single A100 80GB or H100. With 4-bit quantization (GPTQ or AWQ), it can run on consumer GPUs with 24GB VRAM, though with quality degradation on harder tasks. The model is compatible with the standard HuggingFace Transformers pipeline (>= v4.50.0), vLLM, and most common serving frameworks. This simplicity of deployment is a genuine advantage over MoE architectures that require more careful optimization.

Pricing and Availability

Gemma 3 27B is available under the Gemma Terms of Use, which permit commercial use with some restrictions (you must accept Google's license agreement). The model weights are downloadable from HuggingFace after accepting the license.

For managed API access, several providers offer Gemma 3 27B hosting. Google AI Studio provides access, and third-party providers offer it starting around $0.04/M input tokens and $0.15/M output tokens. For comparison, self-hosting on a single A100 instance (roughly $2-3/hour on cloud providers) is often more cost-effective for sustained workloads.

The model is also available through Ollama for local deployment: ollama pull gemma3:27b. Quantized versions are available on HuggingFace for lower VRAM requirements.

Strengths

  • Best multimodal open model at the single-GPU size class (DocVQA 85.6%, ChartQA 76.3%)
  • 128K context window with 140+ language support in a 27B dense model
  • Runs on a single A100/H100 in bfloat16 - no multi-node infrastructure needed
  • LMArena Elo of 1338 outperforms Llama 3.1 405B in human preference
  • Clean deployment story - standard Transformers, vLLM, Ollama all supported
  • Permissive license allows commercial deployment

Weaknesses

  • GPQA Diamond (42.4%) and MMLU-Pro (67.5%) lag behind newer models like Qwen 3.5-27B
  • Output limited to 8,192 tokens - a hard cap that constrains long-form generation
  • Vision limited to images - no video or audio processing
  • Released March 2025 - already showing age against mid-to-late 2025 models on text benchmarks
  • Gemma Terms of Use is not a standard OSI-approved open-source license
  • Local-to-global attention ratio may limit very long-context cross-document reasoning

Sources

Google Gemma 3 27B
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.