TL;DR

Google's 6th-generation TPU - codenamed Trillium - delivering 4.7x compute performance improvement over TPU v5e per chip
32GB HBM per chip with ~1,600 GB/s bandwidth, deployable in pods of up to 256 chips via high-bandwidth ICI interconnect
Purpose-built for Transformer workloads with hardware-native support for BF16, INT8, and FP8 data types
Cloud-only availability through Google Cloud Platform - no on-premises purchase option
Powers internal Google services including Gemini model training and YouTube recommendations

Overview

The TPU v6e Trillium is Google's current-generation Tensor Processing Unit, announced at Google Cloud Next 2024 and made generally available on Google Cloud Platform in Q4 2024. Where NVIDIA and AMD sell discrete accelerators that customers rack in their own data centers, Google's TPU is a vertically integrated play - the silicon, interconnect, software stack, and cloud infrastructure are all Google-designed and Google-operated. You cannot buy a Trillium chip. You rent compute time through Google Cloud.

This architectural approach has trade-offs, but the results speak for themselves. Google claims Trillium delivers a 4.7x improvement in peak compute performance per chip compared to TPU v5e, with a 67% improvement in energy efficiency. Those are substantial generational gains. The 32GB of HBM per chip is modest by GPU standards - the AMD MI300X has 192GB - but TPU pods compensate with scale. A 256-chip Trillium pod provides 8TB of aggregate HBM, connected by Google's proprietary Inter-Chip Interconnect (ICI) at bandwidths that exceed what GPU clusters achieve with InfiniBand or NVLink.

The software ecosystem is simultaneously Trillium's biggest advantage and its most significant limitation. If you are in the JAX/XLA world, TPUs are first-class citizens with years of optimization behind them. Google's own Gemini models are trained on TPU pods. But if your stack is built on PyTorch and CUDA, moving to TPUs requires real engineering work. The JAX ecosystem has grown substantially, and PyTorch/XLA exists as a bridge, but the developer experience is not seamless. For organizations already committed to Google Cloud and JAX-based workflows, Trillium is likely the most cost-effective training accelerator available. For everyone else, the migration cost is a real factor.

Key Specifications

Specification	Details
Manufacturer	Google
Product Family	Cloud TPU
Generation	v6e (Trillium)
Process Node	Not disclosed
Chip Type	TPU (ASIC)
Matrix Units (MXUs)	Not disclosed (improved over v5e)
Supported Data Types	BF16, FP8, INT8
Memory per Chip	32GB HBM
Memory Bandwidth	~1,600 GB/s (estimated)
Peak Compute (per chip)	4.7x TPU v5e (Google's claim)
Energy Efficiency	67% improvement over v5e (Google's claim)
Interconnect	ICI (Inter-Chip Interconnect)
Max Pod Size	256 chips
Aggregate Pod Memory	Up to 8TB HBM
TDP	Not disclosed
Availability	Google Cloud Platform only
Target Workload	Training and Inference
Release Date	Q4 2024 (GA)
Pricing	Cloud-only (on-demand and reserved instances)

Performance Benchmarks

Benchmark / Metric	TPU v6e Trillium	TPU v5e	NVIDIA H100 SXM	AMD MI300X
Peak Compute (relative)	4.7x v5e	1x (baseline)	~3.5x v5e (est.)	~3x v5e (est.)
Memory per Chip	32GB	16GB	80GB	192GB
Memory Bandwidth (est.)	~1,600 GB/s	~800 GB/s	3,350 GB/s	5,300 GB/s
Max Pod Aggregate Memory	8TB (256 chips)	4TB (256 chips)	Cluster-dependent	Cluster-dependent
Gemini Training (relative perf)	Baseline	~4.7x slower	N/A (not used)	N/A (not used)
MLPerf Training (selected tasks)	Competitive	-	Competitive	Competitive
Energy Efficiency vs v5e	67% better	Baseline	N/A	N/A
Pricing Model	Cloud on-demand	Cloud on-demand	Purchase or cloud	Purchase or cloud

Direct chip-to-chip comparisons between TPUs and GPUs are inherently misleading because the architectures are fundamentally different. TPUs are optimized for matrix multiplication throughput on specific data types, while GPUs are more general-purpose. Google's performance claims are relative to their own previous generation, and independent MLPerf results provide the most apples-to-apples comparison available.

On MLPerf Training benchmarks, Google's TPU submissions have historically been competitive with NVIDIA's best GPU-based systems, particularly at large scale where TPU pod interconnect bandwidth provides an advantage. The key insight is that single-chip performance matters less than aggregate pod performance for the workloads Google cares about - pre-training runs that use hundreds or thousands of chips simultaneously.

Key Capabilities

Purpose-Built Transformer Acceleration. Unlike GPUs, which must balance general-purpose compute with AI-specific workloads, TPUs are ASICs designed from the ground up for matrix multiplication and attention operations. Trillium's Matrix Processing Units (MXUs) are optimized for the specific operation patterns found in Transformer models - dense matrix multiplies, softmax, layer normalization. This specialization means that for Transformer training and inference specifically, TPUs can achieve higher utilization rates than GPUs, which carry transistor budget for graphics, ray tracing, and other capabilities that AI workloads never use.

Pod-Scale Interconnect. Google's ICI (Inter-Chip Interconnect) is purpose-built for TPU-to-TPU communication within a pod. Unlike InfiniBand networks that connect discrete servers, ICI provides direct chip-to-chip links with bandwidth and latency characteristics that approach what NVLink provides within a single server, but extended across an entire 256-chip pod. This enables data parallelism and model parallelism strategies that would suffer from communication bottlenecks on GPU clusters. For training runs on large models - 100B+ parameters - the interconnect is often the bottleneck, not the compute, and this is where TPU pods have a structural advantage.

Integrated Software and Hardware Co-Design. The JAX/XLA compiler stack is co-designed with TPU hardware. When you write a JAX program and compile it with XLA for TPU targets, the compiler can make hardware-specific optimizations that are impossible with the more generic GPU compilation pipelines. This includes optimal tensor layout, communication scheduling across pod interconnects, and memory management that is aware of the specific HBM configuration. The result is that well-optimized JAX code on TPUs often achieves higher hardware utilization than equivalent PyTorch code on GPUs - not because the hardware is inherently better, but because the software-hardware integration is tighter.

Pricing and Availability

TPU v6e Trillium is available exclusively through Google Cloud Platform. There is no on-premises purchase option. Pricing follows Google Cloud's standard model with on-demand, committed-use, and spot (preemptible) tiers.

Configuration	On-Demand (per chip-hour)	1-Year Committed	3-Year Committed
TPU v6e (single chip)	~$3.22	~$2.25 (30% discount)	~$1.61 (50% discount)
TPU v6e pod slice (8 chips)	~$25.76	~$18.03	~$12.88
TPU v6e pod (256 chips)	~$824.32	~$577.02	~$412.16

Prices are approximate and vary by region. Spot pricing can offer 60-80% discounts but with preemption risk. For sustained training workloads, committed-use discounts make TPUs highly cost-competitive with equivalent GPU cloud instances.

Cloud Accelerator	Approx. On-Demand Cost	Memory	Notes
TPU v6e (per chip-hour)	~$3.22	32GB	Google Cloud only
NVIDIA H100 (per GPU-hour)	~$3.50-$5.00	80GB	AWS, Azure, GCP
NVIDIA A100 80GB (per GPU-hour)	~$2.50-$3.50	80GB	AWS, Azure, GCP
AMD MI300X (per GPU-hour)	~$2.50-$3.50	192GB	Azure, OCI

The per-chip cost of Trillium is competitive with H100 GPU instances, but direct cost comparisons require normalizing for actual workload throughput. A single TPU v6e chip with 32GB is not equivalent to a single H100 with 80GB. The fair comparison is at the system level - how much does it cost to train a given model to a given quality in a given time on each platform.

Architecture Deep Dive

Understanding Trillium's architecture requires looking at it differently from GPU-based accelerators. TPUs are not general-purpose processors - they are matrix multiplication engines with everything else stripped away or minimized.

Matrix Processing Units (MXUs). The core compute element in every TPU is the Matrix Processing Unit - a systolic array that executes dense matrix multiplications at high throughput. Trillium's MXUs operate on BF16 and FP8 inputs, producing BF16 or FP32 outputs. Google has not disclosed the exact MXU count or dimensions for Trillium, but the 4.7x compute improvement over v5e implies either more MXUs per chip, wider systolic arrays, or higher clock frequencies - likely a combination of all three.

TPU Component	TPU v5e	TPU v6e Trillium	Notes
Matrix Processing Units	Present	Improved (4.7x compute)	Systolic array architecture
Vector Processing Units	Present	Improved	Handles non-matmul operations
HBM per Chip	16GB	32GB	2x capacity increase
HBM Bandwidth	~800 GB/s	~1,600 GB/s	2x bandwidth increase
Data Types	BF16, INT8	BF16, FP8, INT8	FP8 added in v6e
ICI Links	Present	Improved	Higher bandwidth, lower latency
Scalar Core	Present	Improved	Control flow, addressing

Systolic Array Execution Model. TPU MXUs use a systolic array architecture where data flows through a 2D grid of multiply-accumulate units. Input matrices enter from the edges, and results accumulate as data passes through. This design is extremely efficient for the dense matrix multiplications that dominate Transformer workloads - the ratio of compute operations to memory accesses is very high, which is why TPUs can achieve utilization rates above 50% on Transformer training while GPUs typically achieve 30-45%. The trade-off is that operations that do not map well to systolic arrays (irregular memory access patterns, sparse operations, branching logic) run on the scalar and vector units, which are much slower.

ICI Interconnect Architecture. Google's Inter-Chip Interconnect is the architectural feature that most differentiates TPUs from GPU clusters. ICI provides direct chip-to-chip links in a 2D or 3D torus topology, with bandwidth and latency characteristics comparable to on-board interconnects like NVLink, but extending across an entire pod of up to 256 chips.

Interconnect Property	TPU v6e ICI	NVIDIA NVLink 4.0 (H100)	InfiniBand NDR
Topology	2D/3D torus	NVSwitch crossbar	Fat tree
Scope	Up to 256 chips	8 GPUs (single node)	Cluster-wide
Latency (chip-to-chip)	~microseconds	~microseconds (intra-node)	~5-10 microseconds (inter-node)
Aggregate Bandwidth (per chip)	Not disclosed (high)	900 GB/s	400 Gb/s
Communication Model	All-reduce, all-gather native	NCCL library	NCCL + MPI

The ICI advantage is most pronounced for workloads that require frequent all-reduce operations across many chips - exactly the communication pattern that dominates distributed Transformer training. On a GPU cluster, inter-node all-reduce goes over InfiniBand with higher latency and lower bandwidth than intra-node NVLink. On a TPU pod, every chip communicates at ICI speed regardless of position in the pod. This eliminates the "network cliff" that GPU clusters experience when scaling beyond a single node.

Memory System Design. Trillium's 32GB of HBM per chip is modest by GPU standards, but the memory architecture is optimized for the access patterns of Transformer workloads. TPU memory controllers are designed for large, contiguous reads and writes rather than the random access patterns that GPUs must handle for general-purpose computing. This specialization allows higher effective memory bandwidth utilization - Google has stated that Trillium achieves above 80% memory bandwidth utilization on Transformer training, compared to the 60-70% typical for GPUs on similar workloads.

Multi-Chip Configuration and Slice Topology. Trillium chips are deployed in "slices" - groups of chips connected by ICI. Common configurations:

Configuration	Chips	Aggregate Memory	Aggregate Compute	Typical Use Case
v6e-1	1	32GB	1x	Small model inference, fine-tuning
v6e-4	4	128GB	4x	Mid-size model training, inference
v6e-8	8	256GB	8x	70B model training/inference
v6e-16	16	512GB	16x	100B+ model training
v6e-64	64	2TB	64x	Large-scale pre-training
v6e-256	256	8TB	256x	Frontier model pre-training

The ICI interconnect within a slice provides much higher bandwidth than inter-slice communication. For workloads that fit within a single slice, the performance characteristics are near-ideal. Cross-slice communication, when needed, goes through Google's data center network and introduces additional latency. Careful workload partitioning to minimize cross-slice communication is a key optimization for TPU users.

XLA Compilation Pipeline. The JAX-to-XLA-to-TPU compilation pipeline is worth understanding because it differs fundamentally from how GPU code executes. When a JAX function is first called with a specific input shape, XLA compiles the entire computation graph into a TPU-native binary. This compilation can take seconds to minutes for complex models, but the resulting binary is highly optimized - all memory allocations are pre-planned, all tensor layouts are hardware-optimal, and all communication operations are scheduled to overlap with compute. Subsequent calls with the same input shapes execute the cached binary with zero compilation overhead. The trade-off is that dynamic input shapes (varying sequence lengths, changing batch sizes) trigger recompilation, which is why TPU workloads perform best with fixed or bucketed input shapes.

Real-World Performance Analysis

JAX/XLA Training Performance. The most meaningful Trillium benchmarks come from JAX-based training workloads, which is the ecosystem Google has optimized.

Workload	TPU v6e Trillium (256 chips)	Equivalent GPU Cluster	Notes
Gemini-class training	Baseline (Google internal)	N/A	Google does not train Gemini on GPUs
PaLM-style 540B training	~4.7x v5e throughput	Competitive with 256x H100	Google MLPerf submission
ViT-Giant fine-tuning	4.5x v5e per chip	~3x A100 per chip	Vision Transformer, per-chip comparison
T5-XXL (11B) fine-tuning	4.2x v5e per chip	~2.5x A100 per chip	Encoder-decoder model
Stable Diffusion training	Limited optimization	Better on GPUs	Diffusion models less optimized for TPU

The 4.7x improvement over v5e is a peak compute figure. Real workload improvements range from 3.5x to 4.7x depending on how well the workload utilizes the MXU improvements. Memory-bandwidth-bound workloads see closer to 2x improvement (matching the bandwidth doubling), while compute-bound workloads approach the 4.7x peak.

Inference Serving Performance. Google uses Trillium for production inference serving of Gemini models. While Google does not publish per-chip inference throughput numbers in the way that GPU benchmarks are reported, Google Cloud customers report the following approximate performance on Trillium instances:

Model	Configuration	Approximate Throughput	Cost (On-Demand)
Llama 3.1 70B (BF16)	8-chip pod slice	~180-220 tok/s (aggregate)	~$25.76/hr
Llama 3.1 8B (BF16)	1 chip	~60-80 tok/s	~$3.22/hr
Gemma 2 27B (BF16)	2 chips	~90-120 tok/s	~$6.44/hr
Mixtral 8x7B (BF16)	4 chips	~100-140 tok/s	~$12.88/hr

These numbers illustrate the memory constraint. Llama 70B in BF16 requires 140GB, which means at least 5 Trillium chips (5 x 32GB = 160GB). The 8-chip configuration provides headroom for KV cache. Compare this to a single MI300X at 192GB serving the same model on one accelerator - the TPU requires more silicon but the pod interconnect keeps cross-chip overhead low.

PyTorch/XLA Performance. For teams using PyTorch rather than JAX, the PyTorch/XLA bridge allows running PyTorch workloads on TPUs. Performance through PyTorch/XLA is typically 70-85% of equivalent JAX performance due to the translation overhead and less optimal tensor layouts. This gap has narrowed with each PyTorch/XLA release, but teams migrating from CUDA should expect some performance discount and debugging overhead. Common pain points include dynamic shapes (TPU XLA compilation prefers static shapes), custom CUDA kernels (which must be rewritten in XLA), and batch normalization edge cases.

MLPerf Results. On MLPerf Training v4.0, Google submitted Trillium results that demonstrated competitive or leading performance on several benchmarks. The most relevant comparisons:

MLPerf Benchmark	Google TPU v6e Result	NVIDIA H100 (comparable scale)	Winner
BERT Large	Top-tier submission	Top-tier submission	Platform-dependent
GPT-3 175B	Competitive	Competitive	Close
Stable Diffusion	Slower	Faster	NVIDIA
ResNet-50	Competitive	Competitive	Close

The pattern is consistent: on Transformer-native workloads, Trillium is competitive or leading. On workloads with irregular compute patterns (diffusion models, some vision architectures), GPUs have an architectural advantage.

Cost Efficiency Analysis. When comparing cloud accelerators, cost-per-useful-computation is the metric that matters. Here is an approximate cost comparison for training a 70B parameter model for 1 trillion tokens:

Platform	Configuration	Estimated Training Time	Estimated Cost	Cost per 1T Tokens
TPU v6e (256 chips, committed)	Full pod	~18-22 days	~$220,000-$320,000	Competitive
H100 (256 GPUs, cloud)	32x 8-GPU nodes	~15-20 days	~$350,000-$500,000	Higher
MI300X (256 GPUs, cloud)	32x 8-GPU nodes	~18-24 days	~$250,000-$380,000	Moderate

The cost advantage of Trillium is most pronounced with committed-use pricing (30-50% discount) and for workloads that achieve high MXU utilization. Spot instances can reduce costs further but introduce preemption risk that requires robust checkpointing.

Hardware Utilization Rates. One of Trillium's underappreciated advantages is hardware utilization. Because XLA compiles entire computation graphs with knowledge of the TPU's memory layout and systolic array dimensions, Trillium typically achieves higher MFU (Model FLOPs Utilization) than GPUs on equivalent workloads:

Platform	Typical MFU (Transformer Training)	Peak MFU (Optimized)	Notes
TPU v6e Trillium (JAX)	50-60%	65-70%	XLA hardware-aware compilation
NVIDIA H100 (PyTorch)	35-45%	50-55%	Manual optimization needed
NVIDIA H100 (Megatron-LM)	45-55%	55-60%	Highly optimized framework
AMD MI300X (PyTorch)	30-40%	45-50%	ROCm optimization still maturing

Higher MFU means more useful computation per dollar spent on hardware. A Trillium chip running at 55% MFU delivers more effective compute than an H100 running at 40% MFU, even if the H100 has higher peak TFLOPS. This utilization advantage is one of the key reasons that Google's internal training costs on TPU pods are competitive despite the per-chip specs appearing modest.

Checkpoint and Recovery. TPU training requires robust checkpoint/recovery mechanisms because spot instances can be preempted and hardware failures in 256-chip pods are statistically likely. Google's Orbax checkpointing library is designed specifically for TPU workloads and provides:

Asynchronous checkpointing that does not block training
Distributed checkpoint writing across multiple storage backends
Automatic recovery from chip failures with minimal lost computation
Integration with Google Cloud Storage for durable checkpoint storage

For production training runs, implementing proper checkpointing is not optional - it is a requirement. Teams that do not invest in robust checkpoint infrastructure will lose more compute time to failures and preemption than they save from spot pricing discounts.

Generational and Competitive Context

JAX/XLA Migration Guide. For teams considering a move from PyTorch/CUDA to JAX/XLA for Trillium, the migration effort depends on the complexity of your codebase:

Codebase Complexity	Migration Effort	Timeline	Notes
Standard Transformer training	Moderate	2-4 weeks	JAX equivalents exist for most PyTorch patterns
Custom attention mechanisms	Significant	4-8 weeks	Must rewrite as XLA-compatible operations
Custom CUDA kernels	Major	8-16 weeks	No CUDA on TPU - must use Pallas or XLA custom calls
Multi-modal pipelines	Significant	6-12 weeks	Vision + language combinations need careful porting
Simple fine-tuning	Low	1-2 weeks	Libraries like MaxText simplify this

Google provides several resources to ease migration: the MaxText reference implementation (a JAX-based LLM training codebase), the Pallas kernel language for writing custom TPU operations, and the JAX-Toolbox for common distributed training patterns. The open-source ecosystem around JAX has also grown significantly, with libraries like Flax, Orbax, and Grain providing PyTorch-equivalent functionality.

vs. TPU v5e. The v5e to v6e transition is the largest generational improvement in TPU history. The 4.7x compute, 2x memory, and 67% efficiency gains are all significant. For existing TPU v5e customers, migrating to Trillium is straightforward - JAX code requires no changes, and XLA recompilation handles the hardware differences. The economic case for migration is strong: a Trillium chip at ~$3.22/hr delivers 4.7x the compute of a v5e chip at a roughly comparable per-chip price.

vs. TPU v7 Ironwood. Ironwood is purpose-built for inference, while Trillium handles both training and inference. For customers who need both training and inference capacity, Trillium is the current choice. Once Ironwood launches, the optimal strategy for Google Cloud customers will likely be Trillium for training and Ironwood for inference, each running on silicon optimized for its workload. Organizations that only need inference should evaluate whether waiting for Ironwood makes sense based on their deployment timeline.

vs. NVIDIA H100 (Cloud Instances). The H100 and Trillium compete directly on Google Cloud, where both are available. For JAX-native workloads, Trillium offers better price-performance because the software-hardware co-design eliminates the overhead inherent in running on general-purpose GPU silicon. For PyTorch workloads, H100 instances on Google Cloud are typically more productive because the CUDA ecosystem is more mature. The decision comes down to software stack: JAX teams should default to Trillium, PyTorch teams should default to H100.

vs. AMD MI300X. The MI300X is purchasable hardware with 192GB memory on a single accelerator - a fundamentally different proposition from cloud-only TPUs. Organizations that need on-premises AI compute are not in the TPU market. For cloud-only customers, the MI300X on Azure or OCI competes with Trillium on GCP. The MI300X's memory advantage (192GB per card vs 32GB per TPU chip) matters for single-accelerator inference simplicity, but Trillium's pod interconnect makes multi-chip configurations nearly seamless.

vs. Huawei Ascend 910B/910C. The Ascend 910B and 910C serve the Chinese domestic market exclusively and are not available on any cloud platform accessible to Trillium's target customers. The comparison is relevant only for understanding the global AI accelerator landscape. Architecturally, the Ascend Da Vinci cores and TPU MXUs share the systolic-array design philosophy - both are ASICs optimized for matrix multiplication. The TPU's advantage is the ICI interconnect at pod scale, while the Ascend's advantage is purchasable hardware with larger per-chip memory (64-96GB vs 32GB). CANN and JAX/XLA are both smaller ecosystems compared to CUDA, but JAX/XLA benefits from Google's direct investment and a larger open-source community.

Google Cloud Ecosystem Integration. One advantage that is hard to quantify but real in practice is Trillium's integration with the broader Google Cloud AI platform. Vertex AI pipelines, Cloud Storage for training data, BigQuery for dataset preparation, and Kubernetes-based orchestration (GKE) all work natively with TPU resources. For organizations already on Google Cloud, the operational overhead of adopting Trillium is substantially lower than setting up GPU clusters from scratch on another provider.

Key JAX Libraries for TPU Workloads. The JAX ecosystem has matured significantly and offers robust alternatives to PyTorch libraries:

PyTorch Equivalent	JAX Library	Maturity	Notes
torch.nn / PyTorch Lightning	Flax / NNX	Production	Google-maintained, widely used
Hugging Face Transformers	MaxText / T5X	Production	Google reference implementations
PyTorch DataLoader	Grain / tf.data	Production	TPU-optimized data pipelines
torch.save / torch.load	Orbax	Production	Async distributed checkpointing
DeepSpeed / FSDP	jax.sharding / pjit	Production	Native JAX parallelism primitives
torch.compile	jax.jit	Production	Core JAX feature
Custom CUDA Kernels	Pallas	Maturing	TPU kernel language, still evolving

For teams evaluating the JAX ecosystem, the key insight is that the core training infrastructure is production-ready. The gaps are primarily in niche operators, specialized model architectures not widely used on TPU, and third-party integrations that assume CUDA.

Reliability and Uptime. For production training runs that span days or weeks, hardware reliability matters. Google does not publish TPU failure rates, but anecdotal reports from large-scale TPU users suggest that individual chip failures occur approximately once per 1,000 chip-hours on average. For a 256-chip pod running for 20 days, this means approximately 4-5 expected chip failures during the training run. Google's TPU VM orchestration layer handles chip failures automatically when enabled, redistributing work to healthy chips. However, this requires the training framework to support elastic training or frequent checkpointing. JAX-based training with Orbax checkpointing handles this well; custom training loops may need additional engineering for fault tolerance.

Regional Availability. Trillium availability varies by Google Cloud region:

Region	Trillium Availability	Notes
us-central1	Generally available	Largest TPU capacity
us-east1	Generally available	Good capacity
europe-west4	Generally available	EU data residency
asia-northeast1	Limited	Smaller allocation

For organizations with data residency requirements, the availability of Trillium in European regions is important. Training on EU-resident data requires compute in EU regions, and Trillium availability in europe-west4 makes this possible without data transfer out of jurisdiction.

Cloud Pricing Tiers in Detail. The cost optimization strategy for Trillium depends heavily on workload predictability:

Pricing Tier	Cost Multiplier	Best For	Risk
On-Demand	1.0x (baseline)	Experimentation, unpredictable workloads	None - always available
1-Year Committed	0.7x	Sustained training, known capacity needs	Minimum spend commitment
3-Year Committed	0.5x	Production inference fleets, long-term training	Long commitment, technology may advance
Spot/Preemptible	0.2-0.4x	Fault-tolerant training with checkpointing	Preemption, no availability guarantee

For training workloads that can handle preemption (with frequent checkpointing), spot Trillium instances at 60-80% discount represent some of the cheapest AI compute available anywhere. The key requirement is that your training framework must handle checkpoint/resume gracefully - JAX-based training loops on TPU typically support this natively.

Use Case Recommendations

Strong Fit:

Google Cloud-native organizations training large Transformer models. If you are already on GCP and using JAX, Trillium is almost certainly your most cost-effective training option. The software-hardware co-design and committed-use discounts make the economics compelling.
Large-scale pre-training runs (100B+ parameters). Trillium's ICI interconnect at 256-chip pod scale provides communication efficiency that GPU clusters struggle to match. If your training run needs 100+ accelerators, the pod architecture starts to show measurable advantages over InfiniBand-connected GPU clusters.
Teams willing to commit to JAX/XLA. The investment in learning JAX pays dividends in hardware utilization. If your team is building new training infrastructure and is open to framework choice, JAX + Trillium is a strong combination.
Budget-sensitive research with checkpointing. Spot Trillium instances at $0.60-$1.30/chip-hour (60-80% off on-demand) provide research-grade compute at prices that undercut nearly everything else in the market.
Fine-tuning and distillation workloads. Trillium's 32GB per chip is sufficient for fine-tuning models up to ~15B parameters on a single chip, and pod slices handle larger models efficiently. The cost-per-experiment is low enough to enable rapid iteration.
Organizations training Gemma, PaLM, or T5-based models. These Google-origin architectures have the most mature JAX implementations and run at near-peak efficiency on Trillium. The software-hardware co-optimization for these specific model families is unmatched.

Weak Fit:

PyTorch-first teams with existing CUDA codebases. The migration cost from CUDA to TPU is measured in weeks or months of engineering time. Unless the workload is large enough to justify that investment, staying on GPU is more productive.
Inference workloads with large single-model memory requirements. Serving a 70B model at BF16 requires 5+ Trillium chips. A single MI300X handles this on one accelerator. For inference simplicity, GPUs with high memory capacity are often easier to operationalize.
Workloads with heavy custom kernel requirements. If your model relies on custom CUDA kernels for specialized operations, porting to XLA is non-trivial. TPU XLA compilation requires static shapes and does not support arbitrary pointer arithmetic.
Organizations with multi-cloud or on-premises requirements. Trillium is GCP-only. If your infrastructure strategy requires provider flexibility or on-premises deployment, TPUs are not an option.
Small-batch, latency-sensitive inference. For serving single requests with minimal latency, GPUs with large per-device memory (like the MI300X at 192GB) can hold models on one device without cross-chip communication. Trillium's 32GB per chip means even moderate models require multi-chip serving with added latency.
Teams that need fine-grained hardware control. TPU programming through XLA is declarative - you describe the computation and let the compiler decide execution. This works well for standard patterns but limits control over memory layout, scheduling, and kernel execution order. Teams that need low-level hardware control should prefer GPUs.

Strengths

4.7x compute performance improvement over TPU v5e - the largest generational jump in TPU history
Pod-scale ICI interconnect enables efficient scaling to 256 chips with minimal communication overhead
JAX/XLA software stack is co-designed with hardware for optimal utilization
67% energy efficiency improvement reduces operational costs and environmental impact
Competitive cloud pricing, especially with committed-use discounts
Powers Google's own Gemini training - proven at the most demanding scale
No capital expenditure - cloud-only model eliminates hardware procurement and maintenance

Weaknesses

32GB HBM per chip is modest - large models require multi-chip configurations that GPUs handle on a single card
Cloud-only availability creates vendor lock-in with Google Cloud Platform
JAX/XLA ecosystem is smaller than PyTorch/CUDA - migration cost is real for existing GPU workflows
PyTorch/XLA bridge exists but is not seamless - expect compatibility gaps and performance overhead
Google does not disclose detailed hardware specifications, making independent evaluation difficult
Spot instance preemption can interrupt long training runs if not architected for checkpointing
No path to on-premises deployment for organizations with data sovereignty or compliance requirements

Google TPU v7 Ironwood - Google's next-generation inference-optimized TPU
AMD Instinct MI300X - AMD's 192GB HBM3 data center GPU
AMD Instinct MI350X - AMD's upcoming CDNA 4 flagship
Huawei Ascend 910C - China's flagship AI accelerator
Huawei Ascend 910B - Huawei's workhorse AI training chip

Google TPU v6e Trillium

Overview

Key Specifications

Performance Benchmarks

Key Capabilities

Pricing and Availability

Architecture Deep Dive

Real-World Performance Analysis

Generational and Competitive Context

Use Case Recommendations

Strengths

Weaknesses

Sources

Overview

Key Specifications

Performance Benchmarks

Key Capabilities

Pricing and Availability

Architecture Deep Dive

Real-World Performance Analysis

Generational and Competitive Context

Use Case Recommendations

Strengths

Weaknesses

Related Coverage

Sources

Google Analytics