Hardware

Google TPU v6e Trillium

Google Cloud TPU v6e Trillium specs, benchmarks, and pricing. 32GB HBM per chip, ~1,600 GB/s bandwidth, optimized for Transformer training and inference at cloud scale.

Google TPU v6e Trillium

TL;DR

  • Google's 6th-generation TPU - codenamed Trillium - delivering 4.7x compute performance improvement over TPU v5e per chip
  • 32GB HBM per chip with ~1,600 GB/s bandwidth, deployable in pods of up to 256 chips via high-bandwidth ICI interconnect
  • Purpose-built for Transformer workloads with hardware-native support for BF16, INT8, and FP8 data types
  • Cloud-only availability through Google Cloud Platform - no on-premises purchase option
  • Powers internal Google services including Gemini model training and YouTube recommendations

Overview

The TPU v6e Trillium is Google's current-generation Tensor Processing Unit, announced at Google Cloud Next 2024 and made generally available on Google Cloud Platform in Q4 2024. Where NVIDIA and AMD sell discrete accelerators that customers rack in their own data centers, Google's TPU is a vertically integrated play - the silicon, interconnect, software stack, and cloud infrastructure are all Google-designed and Google-operated. You cannot buy a Trillium chip. You rent compute time through Google Cloud.

This architectural approach has trade-offs, but the results speak for themselves. Google claims Trillium delivers a 4.7x improvement in peak compute performance per chip compared to TPU v5e, with a 67% improvement in energy efficiency. Those are substantial generational gains. The 32GB of HBM per chip is modest by GPU standards - the AMD MI300X has 192GB - but TPU pods compensate with scale. A 256-chip Trillium pod provides 8TB of aggregate HBM, connected by Google's proprietary Inter-Chip Interconnect (ICI) at bandwidths that exceed what GPU clusters achieve with InfiniBand or NVLink.

The software ecosystem is simultaneously Trillium's biggest advantage and its most significant limitation. If you are in the JAX/XLA world, TPUs are first-class citizens with years of optimization behind them. Google's own Gemini models are trained on TPU pods. But if your stack is built on PyTorch and CUDA, moving to TPUs requires real engineering work. The JAX ecosystem has grown substantially, and PyTorch/XLA exists as a bridge, but the developer experience is not seamless. For organizations already committed to Google Cloud and JAX-based workflows, Trillium is likely the most cost-effective training accelerator available. For everyone else, the migration cost is a real factor.

Key Specifications

SpecificationDetails
ManufacturerGoogle
Product FamilyCloud TPU
Generationv6e (Trillium)
Process NodeNot disclosed
Chip TypeTPU (ASIC)
Matrix Units (MXUs)Not disclosed (improved over v5e)
Supported Data TypesBF16, FP8, INT8
Memory per Chip32GB HBM
Memory Bandwidth~1,600 GB/s (estimated)
Peak Compute (per chip)4.7x TPU v5e (Google's claim)
Energy Efficiency67% improvement over v5e (Google's claim)
InterconnectICI (Inter-Chip Interconnect)
Max Pod Size256 chips
Aggregate Pod MemoryUp to 8TB HBM
TDPNot disclosed
AvailabilityGoogle Cloud Platform only
Target WorkloadTraining and Inference
Release DateQ4 2024 (GA)
PricingCloud-only (on-demand and reserved instances)

Performance Benchmarks

Benchmark / MetricTPU v6e TrilliumTPU v5eNVIDIA H100 SXMAMD MI300X
Peak Compute (relative)4.7x v5e1x (baseline)~3.5x v5e (est.)~3x v5e (est.)
Memory per Chip32GB16GB80GB192GB
Memory Bandwidth (est.)~1,600 GB/s~800 GB/s3,350 GB/s5,300 GB/s
Max Pod Aggregate Memory8TB (256 chips)4TB (256 chips)Cluster-dependentCluster-dependent
Gemini Training (relative perf)Baseline~4.7x slowerN/A (not used)N/A (not used)
MLPerf Training (selected tasks)Competitive-CompetitiveCompetitive
Energy Efficiency vs v5e67% betterBaselineN/AN/A
Pricing ModelCloud on-demandCloud on-demandPurchase or cloudPurchase or cloud

Direct chip-to-chip comparisons between TPUs and GPUs are inherently misleading because the architectures are fundamentally different. TPUs are optimized for matrix multiplication throughput on specific data types, while GPUs are more general-purpose. Google's performance claims are relative to their own previous generation, and independent MLPerf results provide the most apples-to-apples comparison available.

On MLPerf Training benchmarks, Google's TPU submissions have historically been competitive with NVIDIA's best GPU-based systems, particularly at large scale where TPU pod interconnect bandwidth provides an advantage. The key insight is that single-chip performance matters less than aggregate pod performance for the workloads Google cares about - pre-training runs that use hundreds or thousands of chips simultaneously.

Key Capabilities

Purpose-Built Transformer Acceleration. Unlike GPUs, which must balance general-purpose compute with AI-specific workloads, TPUs are ASICs designed from the ground up for matrix multiplication and attention operations. Trillium's Matrix Processing Units (MXUs) are optimized for the specific operation patterns found in Transformer models - dense matrix multiplies, softmax, layer normalization. This specialization means that for Transformer training and inference specifically, TPUs can achieve higher utilization rates than GPUs, which carry transistor budget for graphics, ray tracing, and other capabilities that AI workloads never use.

Pod-Scale Interconnect. Google's ICI (Inter-Chip Interconnect) is purpose-built for TPU-to-TPU communication within a pod. Unlike InfiniBand networks that connect discrete servers, ICI provides direct chip-to-chip links with bandwidth and latency characteristics that approach what NVLink provides within a single server, but extended across an entire 256-chip pod. This enables data parallelism and model parallelism strategies that would suffer from communication bottlenecks on GPU clusters. For training runs on large models - 100B+ parameters - the interconnect is often the bottleneck, not the compute, and this is where TPU pods have a structural advantage.

Integrated Software and Hardware Co-Design. The JAX/XLA compiler stack is co-designed with TPU hardware. When you write a JAX program and compile it with XLA for TPU targets, the compiler can make hardware-specific optimizations that are impossible with the more generic GPU compilation pipelines. This includes optimal tensor layout, communication scheduling across pod interconnects, and memory management that is aware of the specific HBM configuration. The result is that well-optimized JAX code on TPUs often achieves higher hardware utilization than equivalent PyTorch code on GPUs - not because the hardware is inherently better, but because the software-hardware integration is tighter.

Pricing and Availability

TPU v6e Trillium is available exclusively through Google Cloud Platform. There is no on-premises purchase option. Pricing follows Google Cloud's standard model with on-demand, committed-use, and spot (preemptible) tiers.

ConfigurationOn-Demand (per chip-hour)1-Year Committed3-Year Committed
TPU v6e (single chip)~$3.22~$2.25 (30% discount)~$1.61 (50% discount)
TPU v6e pod slice (8 chips)~$25.76~$18.03~$12.88
TPU v6e pod (256 chips)~$824.32~$577.02~$412.16

Prices are approximate and vary by region. Spot pricing can offer 60-80% discounts but with preemption risk. For sustained training workloads, committed-use discounts make TPUs highly cost-competitive with equivalent GPU cloud instances.

Cloud AcceleratorApprox. On-Demand CostMemoryNotes
TPU v6e (per chip-hour)~$3.2232GBGoogle Cloud only
NVIDIA H100 (per GPU-hour)~$3.50-$5.0080GBAWS, Azure, GCP
NVIDIA A100 80GB (per GPU-hour)~$2.50-$3.5080GBAWS, Azure, GCP
AMD MI300X (per GPU-hour)~$2.50-$3.50192GBAzure, OCI

The per-chip cost of Trillium is competitive with H100 GPU instances, but direct cost comparisons require normalizing for actual workload throughput. A single TPU v6e chip with 32GB is not equivalent to a single H100 with 80GB. The fair comparison is at the system level - how much does it cost to train a given model to a given quality in a given time on each platform.

Architecture Deep Dive

Understanding Trillium's architecture requires looking at it differently from GPU-based accelerators. TPUs are not general-purpose processors - they are matrix multiplication engines with everything else stripped away or minimized.

Matrix Processing Units (MXUs). The core compute element in every TPU is the Matrix Processing Unit - a systolic array that executes dense matrix multiplications at high throughput. Trillium's MXUs operate on BF16 and FP8 inputs, producing BF16 or FP32 outputs. Google has not disclosed the exact MXU count or dimensions for Trillium, but the 4.7x compute improvement over v5e implies either more MXUs per chip, wider systolic arrays, or higher clock frequencies - likely a combination of all three.

TPU ComponentTPU v5eTPU v6e TrilliumNotes
Matrix Processing UnitsPresentImproved (4.7x compute)Systolic array architecture
Vector Processing UnitsPresentImprovedHandles non-matmul operations
HBM per Chip16GB32GB2x capacity increase
HBM Bandwidth~800 GB/s~1,600 GB/s2x bandwidth increase
Data TypesBF16, INT8BF16, FP8, INT8FP8 added in v6e
ICI LinksPresentImprovedHigher bandwidth, lower latency
Scalar CorePresentImprovedControl flow, addressing

Systolic Array Execution Model. TPU MXUs use a systolic array architecture where data flows through a 2D grid of multiply-accumulate units. Input matrices enter from the edges, and results accumulate as data passes through. This design is extremely efficient for the dense matrix multiplications that dominate Transformer workloads - the ratio of compute operations to memory accesses is very high, which is why TPUs can achieve utilization rates above 50% on Transformer training while GPUs typically achieve 30-45%. The trade-off is that operations that do not map well to systolic arrays (irregular memory access patterns, sparse operations, branching logic) run on the scalar and vector units, which are much slower.

ICI Interconnect Architecture. Google's Inter-Chip Interconnect is the architectural feature that most differentiates TPUs from GPU clusters. ICI provides direct chip-to-chip links in a 2D or 3D torus topology, with bandwidth and latency characteristics comparable to on-board interconnects like NVLink, but extending across an entire pod of up to 256 chips.

Interconnect PropertyTPU v6e ICINVIDIA NVLink 4.0 (H100)InfiniBand NDR
Topology2D/3D torusNVSwitch crossbarFat tree
ScopeUp to 256 chips8 GPUs (single node)Cluster-wide
Latency (chip-to-chip)~microseconds~microseconds (intra-node)~5-10 microseconds (inter-node)
Aggregate Bandwidth (per chip)Not disclosed (high)900 GB/s400 Gb/s
Communication ModelAll-reduce, all-gather nativeNCCL libraryNCCL + MPI

The ICI advantage is most pronounced for workloads that require frequent all-reduce operations across many chips - exactly the communication pattern that dominates distributed Transformer training. On a GPU cluster, inter-node all-reduce goes over InfiniBand with higher latency and lower bandwidth than intra-node NVLink. On a TPU pod, every chip communicates at ICI speed regardless of position in the pod. This eliminates the "network cliff" that GPU clusters experience when scaling beyond a single node.

Memory System Design. Trillium's 32GB of HBM per chip is modest by GPU standards, but the memory architecture is optimized for the access patterns of Transformer workloads. TPU memory controllers are designed for large, contiguous reads and writes rather than the random access patterns that GPUs must handle for general-purpose computing. This specialization allows higher effective memory bandwidth utilization - Google has stated that Trillium achieves above 80% memory bandwidth utilization on Transformer training, compared to the 60-70% typical for GPUs on similar workloads.

Multi-Chip Configuration and Slice Topology. Trillium chips are deployed in "slices" - groups of chips connected by ICI. Common configurations:

ConfigurationChipsAggregate MemoryAggregate ComputeTypical Use Case
v6e-1132GB1xSmall model inference, fine-tuning
v6e-44128GB4xMid-size model training, inference
v6e-88256GB8x70B model training/inference
v6e-1616512GB16x100B+ model training
v6e-64642TB64xLarge-scale pre-training
v6e-2562568TB256xFrontier model pre-training

The ICI interconnect within a slice provides much higher bandwidth than inter-slice communication. For workloads that fit within a single slice, the performance characteristics are near-ideal. Cross-slice communication, when needed, goes through Google's data center network and introduces additional latency. Careful workload partitioning to minimize cross-slice communication is a key optimization for TPU users.

XLA Compilation Pipeline. The JAX-to-XLA-to-TPU compilation pipeline is worth understanding because it differs fundamentally from how GPU code executes. When a JAX function is first called with a specific input shape, XLA compiles the entire computation graph into a TPU-native binary. This compilation can take seconds to minutes for complex models, but the resulting binary is highly optimized - all memory allocations are pre-planned, all tensor layouts are hardware-optimal, and all communication operations are scheduled to overlap with compute. Subsequent calls with the same input shapes execute the cached binary with zero compilation overhead. The trade-off is that dynamic input shapes (varying sequence lengths, changing batch sizes) trigger recompilation, which is why TPU workloads perform best with fixed or bucketed input shapes.

Real-World Performance Analysis

JAX/XLA Training Performance. The most meaningful Trillium benchmarks come from JAX-based training workloads, which is the ecosystem Google has optimized.

WorkloadTPU v6e Trillium (256 chips)Equivalent GPU ClusterNotes
Gemini-class trainingBaseline (Google internal)N/AGoogle does not train Gemini on GPUs
PaLM-style 540B training~4.7x v5e throughputCompetitive with 256x H100Google MLPerf submission
ViT-Giant fine-tuning4.5x v5e per chip~3x A100 per chipVision Transformer, per-chip comparison
T5-XXL (11B) fine-tuning4.2x v5e per chip~2.5x A100 per chipEncoder-decoder model
Stable Diffusion trainingLimited optimizationBetter on GPUsDiffusion models less optimized for TPU

The 4.7x improvement over v5e is a peak compute figure. Real workload improvements range from 3.5x to 4.7x depending on how well the workload utilizes the MXU improvements. Memory-bandwidth-bound workloads see closer to 2x improvement (matching the bandwidth doubling), while compute-bound workloads approach the 4.7x peak.

Inference Serving Performance. Google uses Trillium for production inference serving of Gemini models. While Google does not publish per-chip inference throughput numbers in the way that GPU benchmarks are reported, Google Cloud customers report the following approximate performance on Trillium instances:

ModelConfigurationApproximate ThroughputCost (On-Demand)
Llama 3.1 70B (BF16)8-chip pod slice~180-220 tok/s (aggregate)~$25.76/hr
Llama 3.1 8B (BF16)1 chip~60-80 tok/s~$3.22/hr
Gemma 2 27B (BF16)2 chips~90-120 tok/s~$6.44/hr
Mixtral 8x7B (BF16)4 chips~100-140 tok/s~$12.88/hr

These numbers illustrate the memory constraint. Llama 70B in BF16 requires 140GB, which means at least 5 Trillium chips (5 x 32GB = 160GB). The 8-chip configuration provides headroom for KV cache. Compare this to a single MI300X at 192GB serving the same model on one accelerator - the TPU requires more silicon but the pod interconnect keeps cross-chip overhead low.

PyTorch/XLA Performance. For teams using PyTorch rather than JAX, the PyTorch/XLA bridge allows running PyTorch workloads on TPUs. Performance through PyTorch/XLA is typically 70-85% of equivalent JAX performance due to the translation overhead and less optimal tensor layouts. This gap has narrowed with each PyTorch/XLA release, but teams migrating from CUDA should expect some performance discount and debugging overhead. Common pain points include dynamic shapes (TPU XLA compilation prefers static shapes), custom CUDA kernels (which must be rewritten in XLA), and batch normalization edge cases.

MLPerf Results. On MLPerf Training v4.0, Google submitted Trillium results that demonstrated competitive or leading performance on several benchmarks. The most relevant comparisons:

MLPerf BenchmarkGoogle TPU v6e ResultNVIDIA H100 (comparable scale)Winner
BERT LargeTop-tier submissionTop-tier submissionPlatform-dependent
GPT-3 175BCompetitiveCompetitiveClose
Stable DiffusionSlowerFasterNVIDIA
ResNet-50CompetitiveCompetitiveClose

The pattern is consistent: on Transformer-native workloads, Trillium is competitive or leading. On workloads with irregular compute patterns (diffusion models, some vision architectures), GPUs have an architectural advantage.

Cost Efficiency Analysis. When comparing cloud accelerators, cost-per-useful-computation is the metric that matters. Here is an approximate cost comparison for training a 70B parameter model for 1 trillion tokens:

PlatformConfigurationEstimated Training TimeEstimated CostCost per 1T Tokens
TPU v6e (256 chips, committed)Full pod~18-22 days~$220,000-$320,000Competitive
H100 (256 GPUs, cloud)32x 8-GPU nodes~15-20 days~$350,000-$500,000Higher
MI300X (256 GPUs, cloud)32x 8-GPU nodes~18-24 days~$250,000-$380,000Moderate

The cost advantage of Trillium is most pronounced with committed-use pricing (30-50% discount) and for workloads that achieve high MXU utilization. Spot instances can reduce costs further but introduce preemption risk that requires robust checkpointing.

Hardware Utilization Rates. One of Trillium's underappreciated advantages is hardware utilization. Because XLA compiles entire computation graphs with knowledge of the TPU's memory layout and systolic array dimensions, Trillium typically achieves higher MFU (Model FLOPs Utilization) than GPUs on equivalent workloads:

PlatformTypical MFU (Transformer Training)Peak MFU (Optimized)Notes
TPU v6e Trillium (JAX)50-60%65-70%XLA hardware-aware compilation
NVIDIA H100 (PyTorch)35-45%50-55%Manual optimization needed
NVIDIA H100 (Megatron-LM)45-55%55-60%Highly optimized framework
AMD MI300X (PyTorch)30-40%45-50%ROCm optimization still maturing

Higher MFU means more useful computation per dollar spent on hardware. A Trillium chip running at 55% MFU delivers more effective compute than an H100 running at 40% MFU, even if the H100 has higher peak TFLOPS. This utilization advantage is one of the key reasons that Google's internal training costs on TPU pods are competitive despite the per-chip specs appearing modest.

Checkpoint and Recovery. TPU training requires robust checkpoint/recovery mechanisms because spot instances can be preempted and hardware failures in 256-chip pods are statistically likely. Google's Orbax checkpointing library is designed specifically for TPU workloads and provides:

  • Asynchronous checkpointing that does not block training
  • Distributed checkpoint writing across multiple storage backends
  • Automatic recovery from chip failures with minimal lost computation
  • Integration with Google Cloud Storage for durable checkpoint storage

For production training runs, implementing proper checkpointing is not optional - it is a requirement. Teams that do not invest in robust checkpoint infrastructure will lose more compute time to failures and preemption than they save from spot pricing discounts.

Generational and Competitive Context

JAX/XLA Migration Guide. For teams considering a move from PyTorch/CUDA to JAX/XLA for Trillium, the migration effort depends on the complexity of your codebase:

Codebase ComplexityMigration EffortTimelineNotes
Standard Transformer trainingModerate2-4 weeksJAX equivalents exist for most PyTorch patterns
Custom attention mechanismsSignificant4-8 weeksMust rewrite as XLA-compatible operations
Custom CUDA kernelsMajor8-16 weeksNo CUDA on TPU - must use Pallas or XLA custom calls
Multi-modal pipelinesSignificant6-12 weeksVision + language combinations need careful porting
Simple fine-tuningLow1-2 weeksLibraries like MaxText simplify this

Google provides several resources to ease migration: the MaxText reference implementation (a JAX-based LLM training codebase), the Pallas kernel language for writing custom TPU operations, and the JAX-Toolbox for common distributed training patterns. The open-source ecosystem around JAX has also grown significantly, with libraries like Flax, Orbax, and Grain providing PyTorch-equivalent functionality.

vs. TPU v5e. The v5e to v6e transition is the largest generational improvement in TPU history. The 4.7x compute, 2x memory, and 67% efficiency gains are all significant. For existing TPU v5e customers, migrating to Trillium is straightforward - JAX code requires no changes, and XLA recompilation handles the hardware differences. The economic case for migration is strong: a Trillium chip at ~$3.22/hr delivers 4.7x the compute of a v5e chip at a roughly comparable per-chip price.

vs. TPU v7 Ironwood. Ironwood is purpose-built for inference, while Trillium handles both training and inference. For customers who need both training and inference capacity, Trillium is the current choice. Once Ironwood launches, the optimal strategy for Google Cloud customers will likely be Trillium for training and Ironwood for inference, each running on silicon optimized for its workload. Organizations that only need inference should evaluate whether waiting for Ironwood makes sense based on their deployment timeline.

vs. NVIDIA H100 (Cloud Instances). The H100 and Trillium compete directly on Google Cloud, where both are available. For JAX-native workloads, Trillium offers better price-performance because the software-hardware co-design eliminates the overhead inherent in running on general-purpose GPU silicon. For PyTorch workloads, H100 instances on Google Cloud are typically more productive because the CUDA ecosystem is more mature. The decision comes down to software stack: JAX teams should default to Trillium, PyTorch teams should default to H100.

vs. AMD MI300X. The MI300X is purchasable hardware with 192GB memory on a single accelerator - a fundamentally different proposition from cloud-only TPUs. Organizations that need on-premises AI compute are not in the TPU market. For cloud-only customers, the MI300X on Azure or OCI competes with Trillium on GCP. The MI300X's memory advantage (192GB per card vs 32GB per TPU chip) matters for single-accelerator inference simplicity, but Trillium's pod interconnect makes multi-chip configurations nearly seamless.

vs. Huawei Ascend 910B/910C. The Ascend 910B and 910C serve the Chinese domestic market exclusively and are not available on any cloud platform accessible to Trillium's target customers. The comparison is relevant only for understanding the global AI accelerator landscape. Architecturally, the Ascend Da Vinci cores and TPU MXUs share the systolic-array design philosophy - both are ASICs optimized for matrix multiplication. The TPU's advantage is the ICI interconnect at pod scale, while the Ascend's advantage is purchasable hardware with larger per-chip memory (64-96GB vs 32GB). CANN and JAX/XLA are both smaller ecosystems compared to CUDA, but JAX/XLA benefits from Google's direct investment and a larger open-source community.

Google Cloud Ecosystem Integration. One advantage that is hard to quantify but real in practice is Trillium's integration with the broader Google Cloud AI platform. Vertex AI pipelines, Cloud Storage for training data, BigQuery for dataset preparation, and Kubernetes-based orchestration (GKE) all work natively with TPU resources. For organizations already on Google Cloud, the operational overhead of adopting Trillium is substantially lower than setting up GPU clusters from scratch on another provider.

Key JAX Libraries for TPU Workloads. The JAX ecosystem has matured significantly and offers robust alternatives to PyTorch libraries:

PyTorch EquivalentJAX LibraryMaturityNotes
torch.nn / PyTorch LightningFlax / NNXProductionGoogle-maintained, widely used
Hugging Face TransformersMaxText / T5XProductionGoogle reference implementations
PyTorch DataLoaderGrain / tf.dataProductionTPU-optimized data pipelines
torch.save / torch.loadOrbaxProductionAsync distributed checkpointing
DeepSpeed / FSDPjax.sharding / pjitProductionNative JAX parallelism primitives
torch.compilejax.jitProductionCore JAX feature
Custom CUDA KernelsPallasMaturingTPU kernel language, still evolving

For teams evaluating the JAX ecosystem, the key insight is that the core training infrastructure is production-ready. The gaps are primarily in niche operators, specialized model architectures not widely used on TPU, and third-party integrations that assume CUDA.

Reliability and Uptime. For production training runs that span days or weeks, hardware reliability matters. Google does not publish TPU failure rates, but anecdotal reports from large-scale TPU users suggest that individual chip failures occur approximately once per 1,000 chip-hours on average. For a 256-chip pod running for 20 days, this means approximately 4-5 expected chip failures during the training run. Google's TPU VM orchestration layer handles chip failures automatically when enabled, redistributing work to healthy chips. However, this requires the training framework to support elastic training or frequent checkpointing. JAX-based training with Orbax checkpointing handles this well; custom training loops may need additional engineering for fault tolerance.

Regional Availability. Trillium availability varies by Google Cloud region:

RegionTrillium AvailabilityNotes
us-central1Generally availableLargest TPU capacity
us-east1Generally availableGood capacity
europe-west4Generally availableEU data residency
asia-northeast1LimitedSmaller allocation

For organizations with data residency requirements, the availability of Trillium in European regions is important. Training on EU-resident data requires compute in EU regions, and Trillium availability in europe-west4 makes this possible without data transfer out of jurisdiction.

Cloud Pricing Tiers in Detail. The cost optimization strategy for Trillium depends heavily on workload predictability:

Pricing TierCost MultiplierBest ForRisk
On-Demand1.0x (baseline)Experimentation, unpredictable workloadsNone - always available
1-Year Committed0.7xSustained training, known capacity needsMinimum spend commitment
3-Year Committed0.5xProduction inference fleets, long-term trainingLong commitment, technology may advance
Spot/Preemptible0.2-0.4xFault-tolerant training with checkpointingPreemption, no availability guarantee

For training workloads that can handle preemption (with frequent checkpointing), spot Trillium instances at 60-80% discount represent some of the cheapest AI compute available anywhere. The key requirement is that your training framework must handle checkpoint/resume gracefully - JAX-based training loops on TPU typically support this natively.

Use Case Recommendations

Strong Fit:

  • Google Cloud-native organizations training large Transformer models. If you are already on GCP and using JAX, Trillium is almost certainly your most cost-effective training option. The software-hardware co-design and committed-use discounts make the economics compelling.
  • Large-scale pre-training runs (100B+ parameters). Trillium's ICI interconnect at 256-chip pod scale provides communication efficiency that GPU clusters struggle to match. If your training run needs 100+ accelerators, the pod architecture starts to show measurable advantages over InfiniBand-connected GPU clusters.
  • Teams willing to commit to JAX/XLA. The investment in learning JAX pays dividends in hardware utilization. If your team is building new training infrastructure and is open to framework choice, JAX + Trillium is a strong combination.
  • Budget-sensitive research with checkpointing. Spot Trillium instances at $0.60-$1.30/chip-hour (60-80% off on-demand) provide research-grade compute at prices that undercut nearly everything else in the market.
  • Fine-tuning and distillation workloads. Trillium's 32GB per chip is sufficient for fine-tuning models up to ~15B parameters on a single chip, and pod slices handle larger models efficiently. The cost-per-experiment is low enough to enable rapid iteration.
  • Organizations training Gemma, PaLM, or T5-based models. These Google-origin architectures have the most mature JAX implementations and run at near-peak efficiency on Trillium. The software-hardware co-optimization for these specific model families is unmatched.

Weak Fit:

  • PyTorch-first teams with existing CUDA codebases. The migration cost from CUDA to TPU is measured in weeks or months of engineering time. Unless the workload is large enough to justify that investment, staying on GPU is more productive.
  • Inference workloads with large single-model memory requirements. Serving a 70B model at BF16 requires 5+ Trillium chips. A single MI300X handles this on one accelerator. For inference simplicity, GPUs with high memory capacity are often easier to operationalize.
  • Workloads with heavy custom kernel requirements. If your model relies on custom CUDA kernels for specialized operations, porting to XLA is non-trivial. TPU XLA compilation requires static shapes and does not support arbitrary pointer arithmetic.
  • Organizations with multi-cloud or on-premises requirements. Trillium is GCP-only. If your infrastructure strategy requires provider flexibility or on-premises deployment, TPUs are not an option.
  • Small-batch, latency-sensitive inference. For serving single requests with minimal latency, GPUs with large per-device memory (like the MI300X at 192GB) can hold models on one device without cross-chip communication. Trillium's 32GB per chip means even moderate models require multi-chip serving with added latency.
  • Teams that need fine-grained hardware control. TPU programming through XLA is declarative - you describe the computation and let the compiler decide execution. This works well for standard patterns but limits control over memory layout, scheduling, and kernel execution order. Teams that need low-level hardware control should prefer GPUs.

Strengths

  • 4.7x compute performance improvement over TPU v5e - the largest generational jump in TPU history
  • Pod-scale ICI interconnect enables efficient scaling to 256 chips with minimal communication overhead
  • JAX/XLA software stack is co-designed with hardware for optimal utilization
  • 67% energy efficiency improvement reduces operational costs and environmental impact
  • Competitive cloud pricing, especially with committed-use discounts
  • Powers Google's own Gemini training - proven at the most demanding scale
  • No capital expenditure - cloud-only model eliminates hardware procurement and maintenance

Weaknesses

  • 32GB HBM per chip is modest - large models require multi-chip configurations that GPUs handle on a single card
  • Cloud-only availability creates vendor lock-in with Google Cloud Platform
  • JAX/XLA ecosystem is smaller than PyTorch/CUDA - migration cost is real for existing GPU workflows
  • PyTorch/XLA bridge exists but is not seamless - expect compatibility gaps and performance overhead
  • Google does not disclose detailed hardware specifications, making independent evaluation difficult
  • Spot instance preemption can interrupt long training runs if not architected for checkpointing
  • No path to on-premises deployment for organizations with data sovereignty or compliance requirements

Sources

Google TPU v6e Trillium
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.