Google TPU v7 Ironwood
Google TPU v7 Ironwood specs, architecture, and performance estimates. Google's next-gen inference-optimized TPU with massive memory per chip, announced at Cloud Next 2025.

TL;DR
- Google's 7th-generation TPU - codenamed Ironwood - is purpose-built for inference rather than training
- Massive memory per chip (reportedly 192GB HBM) designed to hold large models without multi-chip sharding
- Scalable to pods of 9,216 chips interconnected via Google's next-gen ICI fabric
- Announced at Google Cloud Next 2025, expected availability in H2 2025
- Represents Google's bet that inference - not training - is the compute bottleneck for the next era of AI
Overview
The TPU v7 Ironwood is a strategic departure for Google's TPU program. Every previous TPU generation was designed primarily for training, with inference as a secondary consideration. Ironwood flips that priority. Announced at Google Cloud Next in April 2025, it is Google's first TPU built from the ground up for inference workloads - specifically, for serving the massive foundation models that Google and its Cloud customers are deploying at scale.
The logic behind this pivot is straightforward math. Training a frontier model is a one-time cost measured in tens of millions of dollars. Serving that model to millions of users every day is an ongoing cost that, over the model's lifetime, dwarfs the training budget by an order of magnitude or more. As AI shifts from a research exercise to a production infrastructure challenge, the hardware that matters most is the hardware that serves inference efficiently. Google appears to have concluded that training and inference workloads are different enough to justify purpose-built silicon for each.
The headline specification is memory. Ironwood reportedly packs 192GB of HBM per chip - a 6x increase over Trillium's 32GB. This is a direct response to the defining constraint of LLM inference: you need to fit the entire model (or a substantial shard of it) in memory, and the KV cache for long-context serving can consume hundreds of gigabytes across a batch. With 192GB per chip, Ironwood can hold a 70B-parameter model in BF16 on a single chip, or serve 405B-class models across just a few chips rather than dozens. Google has also disclosed that Ironwood pods can scale to 9,216 chips, which would provide approximately 1.7 petabytes of aggregate HBM - enough to serve the largest models at extreme throughput.
Key Specifications
| Specification | Details |
|---|---|
| Manufacturer | |
| Product Family | Cloud TPU |
| Generation | v7 (Ironwood) |
| Process Node | Not disclosed |
| Chip Type | TPU (ASIC) |
| Primary Workload | Inference |
| Memory per Chip | ~192GB HBM (reported) |
| Memory Bandwidth | Not disclosed |
| Peak Compute | Not disclosed (significant improvement over Trillium expected) |
| Supported Data Types | BF16, FP8, INT8 (expected, plus potential INT4) |
| Interconnect | Next-gen ICI (Inter-Chip Interconnect) |
| Max Pod Size | 9,216 chips |
| Aggregate Pod Memory | ~1.7 PB HBM |
| TDP | Not disclosed |
| Availability | Google Cloud Platform only (expected H2 2025) |
| Target Workload | Inference |
| Release Date | H2 2025 (expected) |
| Pricing | Cloud-only (not yet announced) |
Note: Google has disclosed limited technical specifications for Ironwood. Many details in this page are based on the Cloud Next 2025 announcement, analyst reports, and informed estimates. This page will be updated as Google releases official specifications and pricing.
Performance Benchmarks (Estimated)
| Benchmark / Metric | TPU v7 Ironwood (est.) | TPU v6e Trillium | NVIDIA H100 SXM | AMD MI300X |
|---|---|---|---|---|
| Memory per Chip | ~192GB | 32GB | 80GB | 192GB |
| Max Pod Size | 9,216 chips | 256 chips | Cluster-dependent | Cluster-dependent |
| Aggregate Pod Memory | ~1.7 PB | 8TB | Cluster-dependent | Cluster-dependent |
| Inference Throughput (relative) | Significant improvement | Baseline | Competitive | Competitive |
| Llama 70B per chip | Yes (BF16) | No (requires multi-chip) | Yes (FP8/INT8) | Yes (BF16) |
| Target Workload | Inference-optimized | Training + Inference | Training + Inference | Training + Inference |
| Availability | Cloud-only | Cloud-only | Purchase + Cloud | Purchase + Cloud |
Direct performance comparisons are not yet possible because Google has not disclosed Ironwood's peak compute numbers or independent inference benchmarks. What we can reason about is the architectural intent. Ironwood's 192GB memory per chip matches the AMD MI300X and exceeds the H100's 80GB. Combined with Google's ICI interconnect at pod scale, this suggests Ironwood is designed to serve the largest models with minimal tensor parallelism overhead.
The pod scale is the real differentiator. A 9,216-chip Ironwood pod with 1.7PB of aggregate memory could theoretically serve a single trillion-parameter model with massive batch sizes, or run thousands of smaller model replicas for maximum throughput. No GPU cluster achieves this level of integration because GPU interconnects (NVLink, InfiniBand) introduce latency and bandwidth constraints that ICI avoids by design.
Key Capabilities
Inference-First Architecture. Ironwood's most significant design decision is the explicit prioritization of inference over training. Inference workloads have different characteristics than training - they are latency-sensitive, memory-bandwidth-bound (rather than compute-bound), and require efficient handling of variable-length requests with unpredictable arrival patterns. By designing Ironwood specifically for these characteristics, Google can make silicon trade-offs that a general-purpose training chip cannot. This includes optimizing for low-latency single-token generation, efficient KV cache management, and high-throughput batching across many concurrent requests.
Massive Memory for Large Model Serving. The ~192GB of HBM per chip directly addresses the biggest infrastructure challenge in LLM inference: fitting the model in memory. A Llama 3.1 70B model in BF16 requires approximately 140GB. On Trillium with 32GB per chip, you need at least 5 chips with tensor parallelism. On Ironwood, it fits on a single chip with room for KV cache. This reduction in parallelism degree directly translates to lower latency (no cross-chip communication for each token), simpler deployment, and better cost efficiency. For 405B-class models, Ironwood would require roughly 3 chips instead of the 13+ needed on Trillium.
9,216-Chip Pod Scale. Google disclosed that Ironwood pods can scale to 9,216 chips - a 36x increase over Trillium's 256-chip maximum. At this scale, a single Ironwood pod could serve inference for every Gemini API request across Google's entire customer base. The ICI interconnect at this scale needs to handle petabits per second of aggregate bandwidth with consistent low latency. Google's ability to build custom interconnect silicon alongside custom compute silicon is a structural advantage that neither NVIDIA nor AMD can replicate - they depend on third-party networks (InfiniBand, Ethernet) for inter-node communication.
Pricing and Availability
Google has not announced pricing for Ironwood. As a cloud-only product, pricing will follow Google Cloud's standard per-chip-hour model with on-demand, committed-use, and spot tiers.
Based on Trillium pricing and the significant memory increase, analyst estimates suggest Ironwood per-chip-hour costs could be 2-4x higher than Trillium's ~$3.22/chip-hour, reflecting the 6x memory increase and compute improvements. However, Google may price aggressively to drive adoption, particularly for inference workloads where it competes directly with NVIDIA GPU cloud instances.
| Accelerator | Estimated Cost | Memory | Notes |
|---|---|---|---|
| TPU v7 Ironwood (est.) | ~$6-$12/chip-hour | ~192GB | Not yet available |
| TPU v6e Trillium | ~$3.22/chip-hour | 32GB | Available now |
| NVIDIA H100 (cloud) | ~$3.50-$5.00/GPU-hour | 80GB | Multiple cloud providers |
| AMD MI300X (cloud) | ~$2.50-$3.50/GPU-hour | 192GB | Azure, OCI |
The cost comparison that matters is not per-chip-hour but per-million-tokens-served. Ironwood's inference optimization could deliver significantly more tokens per chip-hour than a general-purpose H100, making the higher per-chip cost irrelevant if throughput scales proportionally. Google has not disclosed these numbers, but they will be the key metric for evaluating Ironwood's value proposition once it launches.
Architecture Deep Dive
Ironwood represents a fundamental rethinking of what a TPU should be. Every previous TPU generation balanced training and inference. Ironwood drops training and optimizes everything - memory, compute, data movement, and scheduling - for the demands of serving large language models in production.
Inference-Optimized Compute Design. Training and inference have different compute profiles, and Ironwood's silicon reflects this:
| Property | Training Workload | Inference Workload | Ironwood Optimization |
|---|---|---|---|
| Dominant Operation | Forward + backward pass | Forward pass only | No backward pass hardware |
| Batch Characteristics | Large, fixed batches | Variable, continuous batching | Dynamic batch scheduler |
| Latency Sensitivity | Low (throughput-oriented) | High (user-facing) | Low-latency execution paths |
| Memory Access Pattern | Sequential, predictable | Random KV cache reads | KV cache-aware memory controller |
| Precision Requirements | BF16/FP32 (gradient accuracy) | FP8/INT8 (throughput over precision) | Aggressive quantization support |
| Compute-to-Memory Ratio | High compute demand | High memory bandwidth demand | Memory-bandwidth optimized |
By eliminating the backward pass hardware (gradient computation circuits, optimizer state memory management), Ironwood can dedicate more die area to inference-specific features: larger caches for KV storage, wider memory buses for parameter streaming, and specialized scheduling logic for continuous batching. Google has not disclosed the exact area savings, but for reference, backward pass support typically accounts for 20-30% of a training accelerator's transistor budget.
Memory Architecture. The jump from Trillium's 32GB to Ironwood's ~192GB per chip is a 6x increase - the largest memory expansion in TPU history. This is not just more of the same HBM - Ironwood likely uses HBM3 or HBM3e stacks with higher per-stack bandwidth, providing both the capacity and bandwidth that inference workloads demand.
| Memory Property | TPU v6e Trillium | TPU v7 Ironwood (est.) | Improvement |
|---|---|---|---|
| Capacity per Chip | 32GB | ~192GB | 6x |
| Memory Bandwidth | ~1,600 GB/s | Not disclosed (est. 3,000-5,000 GB/s) | ~2-3x (estimated) |
| HBM Generation | HBM (generation not disclosed) | HBM3 or HBM3e (estimated) | Newer generation |
| Aggregate Pod Memory | 8TB (256 chips) | ~1.7PB (9,216 chips) | ~212x |
| KV Cache Budget per Chip | ~10-15GB (after model) | ~50-120GB (after model) | Massive increase |
The KV cache budget is a critical metric for inference that is often overlooked. When serving a 70B model in BF16 (140GB weights), Trillium has essentially zero room for KV cache on a single chip - the model itself exceeds the 32GB capacity. On Ironwood with 192GB, a 70B BF16 model leaves ~52GB for KV cache, enabling long-context serving (128K+ token contexts) or high batch sizes without cross-chip communication for cache lookups.
9,216-Chip Pod Architecture. Ironwood's pod scale represents a 36x increase over Trillium's 256-chip maximum. The ICI interconnect at this scale requires a different topology than the 2D/3D torus used in smaller pods.
| Pod Property | Trillium (256 chips) | Ironwood (9,216 chips) |
|---|---|---|
| Max Chips per Pod | 256 | 9,216 |
| Aggregate Memory | 8TB | ~1.7PB |
| Aggregate Compute | High | ~36x Trillium (est.) |
| ICI Topology | 2D/3D torus | Multi-level hierarchy (est.) |
| Bisection Bandwidth | High | Very high (est.) |
| Use Case | Training + inference | Inference farms |
At 9,216 chips with 192GB each, an Ironwood pod provides approximately 1.7 petabytes of aggregate HBM. To put this in perspective: a Llama 405B model in BF16 requires ~810GB. A single Ironwood pod could serve over 2,000 independent replicas of this model simultaneously, or a single model with massive tensor parallelism for ultra-low-latency serving. This scale is designed for Google's internal Gemini serving needs - millions of API calls per minute across Google Search, Workspace, and Cloud API customers.
Inference-Specific Hardware Features. Based on Google's announcements and industry analysis, Ironwood likely includes several hardware features specifically targeting inference efficiency:
- Continuous batching support - Hardware-level scheduling that can add new requests to an in-flight batch without waiting for the longest sequence to complete
- KV cache compression - Hardware-assisted compression of attention key-value caches to reduce memory footprint for long-context serving
- Speculative decoding support - Hardware paths optimized for running a small draft model and a large verification model in tandem
- INT4/INT8 native support - Low-precision integer operations for maximizing throughput on quantized models
- Prefill/decode separation - Ability to run prefill (compute-bound) and decode (memory-bound) phases on different chip configurations within the same pod
Comparison to Trillium Architecture. Understanding what changed between generations clarifies Ironwood's design philosophy:
| Architectural Decision | Trillium (Training-focused) | Ironwood (Inference-focused) |
|---|---|---|
| Backward Pass Support | Full gradient computation hardware | Minimal or none |
| Memory Priority | Balanced (weights + activations + gradients) | Weights + KV cache optimized |
| Batch Scheduling | Static batch, fixed size | Dynamic continuous batching |
| Precision Focus | BF16 primary (gradient accuracy) | FP8/INT8 primary (throughput) |
| Pod Scale | 256 chips (training parallelism) | 9,216 chips (inference throughput) |
| ICI Optimization | All-reduce for gradient sync | Point-to-point for model sharding |
| Power Profile | Sustained high compute | Burst compute with idle periods |
The architectural differences reflect a fundamental insight about inference workloads: they are bursty, latency-sensitive, and memory-dominated. A token generation step reads the entire model from memory but performs relatively little computation per byte read. This makes memory bandwidth, not compute throughput, the primary bottleneck. Ironwood's design appears to prioritize bandwidth and capacity over raw TFLOPS.
Real-World Performance Analysis
Since Ironwood has not launched at the time of writing, no independent benchmarks exist. However, we can construct a performance framework based on Ironwood's known specifications and the inference workload characteristics it targets.
Projected Single-Chip Inference Performance. Based on the 192GB memory per chip and estimated memory bandwidth improvements over Trillium:
| Model | Precision | Ironwood Single Chip (est.) | Trillium (min chips needed) | Notes |
|---|---|---|---|---|
| Llama 3.1 70B | BF16 | 40-60 tok/s (est.) | 5+ chips | Ironwood: single chip, memory-bandwidth bound |
| Llama 3.1 70B | FP8 | 70-100 tok/s (est.) | 3+ chips | FP8 halves memory, doubles effective bandwidth |
| Llama 3.1 8B | BF16 | 200-300 tok/s (est.) | 1 chip | Compute-bound on both |
| Llama 3.1 405B | FP8 | N/A (requires 2+ chips) | 13+ chips | 405GB FP8 > 192GB |
| Gemini Ultra | Google-optimized | Baseline (internal) | Slower (internal) | Google's primary use case |
These are rough estimates based on architectural reasoning, not measured data. Actual performance will depend on Ironwood's memory bandwidth (undisclosed), compute throughput (undisclosed), and XLA compiler optimizations at launch.
Pod-Scale Inference Economics. The real value proposition of Ironwood is not single-chip performance but pod-scale inference cost efficiency. Google's goal is to reduce the cost-per-token for serving Gemini and other large models to its Cloud customers. The relevant metric is cost per million output tokens:
| Platform | Estimated Cost per 1M Output Tokens (70B model) | Notes |
|---|---|---|
| Ironwood (committed pricing, est.) | $0.20-$0.50 (est.) | Aggressive estimate based on efficiency claims |
| Trillium (committed pricing) | $0.50-$1.00 (est.) | Multi-chip overhead increases cost |
| H100 Cloud (on-demand) | $0.80-$1.50 | Varies by provider, model optimization |
| MI300X Cloud (on-demand) | $0.60-$1.20 | Memory advantage helps, fewer providers |
If Google can achieve the lower end of these estimates, Ironwood could make Google Cloud the most cost-effective platform for LLM inference serving - which is precisely the strategic position Google wants.
Long-Context Inference Considerations. One area where Ironwood's architecture could provide significant advantages is long-context inference (128K+ token contexts). Long contexts require enormous KV caches - a 70B model with 128K context and batch size 1 can consume 40-60GB of KV cache memory. On Ironwood with 192GB per chip and ~52GB available after model weights, a single chip could handle this. On Trillium with 32GB per chip, the KV cache alone would require multiple chips, with cross-chip latency on every attention computation.
| Context Length | KV Cache Size (70B model, BF16) | Ironwood Chips Needed | Trillium Chips Needed |
|---|---|---|---|
| 4K tokens | ~2-3GB | 1 (fits easily) | 5+ (model-limited) |
| 32K tokens | ~10-15GB | 1 (fits easily) | 5+ (model-limited) |
| 128K tokens | ~40-60GB | 1 (tight fit) | 7-8 (cache-limited) |
| 1M tokens | ~300-500GB | 2-3 | 15-20 |
For applications like document analysis, code generation with large codebases, and multi-turn conversations with long histories, Ironwood's per-chip memory budget is a significant advantage. Google's Gemini models already support 1M+ token contexts, and serving these efficiently requires the kind of memory density that Ironwood provides.
Comparison with GPU-Based Inference. The key architectural advantages Ironwood holds for inference versus GPU-based solutions:
| Advantage Area | Ironwood (TPU) | GPU (H100/B200) | Impact |
|---|---|---|---|
| Software-hardware co-design | XLA compiles directly for Ironwood silicon | CUDA generic + TensorRT optimization layer | Higher utilization on TPU |
| Interconnect at scale | ICI across 9,216 chips | NVLink (8 GPUs) + InfiniBand (inter-node) | Lower latency at scale on TPU |
| Inference specialization | Purpose-built, no training overhead | General-purpose, carries training hardware | Better efficiency per watt on TPU |
| Memory per chip | ~192GB (matches MI300X) | 80GB (H100) / 192GB (B200/MI300X) | Competitive |
| Software ecosystem | JAX/XLA only | CUDA, PyTorch, TensorRT, vLLM, etc. | Much broader on GPU |
Generational and Competitive Context
vs. TPU v6e Trillium. Ironwood is not a direct successor to Trillium - it is a parallel product line optimized for a different workload. Google's emerging strategy is Trillium for training, Ironwood for inference. Customers who currently use Trillium for both training and inference should plan to migrate inference workloads to Ironwood once it becomes available, while keeping Trillium for training. JAX code should port between the two with minimal changes - XLA compilation handles the hardware differences.
vs. NVIDIA B200/GB300. NVIDIA's Blackwell generation is Ironwood's closest competitor in the inference market. The B200 offers 192GB HBM3e and ~4,500 TFLOPS FP8, but it is a general-purpose accelerator that handles both training and inference. Ironwood's advantage is inference specialization and pod-scale ICI interconnect. NVIDIA's advantage is software ecosystem breadth - TensorRT-LLM, vLLM with CUDA, and the entire PyTorch inference toolchain. For organizations running on Google Cloud, Ironwood will likely be more cost-effective for Transformer inference. For organizations that need hardware flexibility or multi-cloud deployment, NVIDIA remains the default choice.
vs. AMD MI300X/MI350X. The MI300X matches Ironwood's 192GB memory per chip and is available for purchase - not cloud-only. The MI350X goes further with 288GB. For organizations that want to own their inference hardware, AMD's Instinct line is the relevant comparison. Ironwood targets a different buyer - cloud-native organizations that prefer operational expenditure over capital expenditure and are willing to accept Google Cloud lock-in in exchange for purpose-built inference efficiency.
vs. Huawei Ascend 910C. The Ascend 910C is not a direct competitor in any practical sense - Ironwood is a cloud-only product on Google Cloud, and the 910C is available only in China. However, the comparison is instructive for understanding the state of AI accelerator diversity. Ironwood with 192GB HBM and purpose-built inference optimization is at the high end of what inference silicon can deliver. The 910C with 96GB HBM2e and ~1,800 GB/s bandwidth represents what is achievable under export control constraints. The performance gap between the two reflects the manufacturing technology gap between TSMC-class and SMIC-class foundries.
Inference Architecture Market Trend. Ironwood is not the only inference-specialized chip entering the market. The broader trend toward inference-specific silicon reflects the industry's recognition that training and inference are fundamentally different workloads:
| Inference-Focused Product | Manufacturer | Memory | Availability | Approach |
|---|---|---|---|---|
| TPU v7 Ironwood | ~192GB | Cloud-only (GCP) | Custom ASIC, ICI interconnect | |
| NVIDIA GB300 NVL72 | NVIDIA | 1,664GB (aggregate 72 GPUs) | Purchase + cloud | GPU rack-scale inference |
| Groq LPU | Groq | 230MB SRAM (no HBM) | Cloud API | Ultra-low-latency ASIC |
| AWS Inferentia2 | Amazon | 32GB HBM2e per chip | Cloud-only (AWS) | Custom inference ASIC |
| Intel Gaudi 3 | Intel | 128GB HBM2e | Purchase + cloud | Inference-optimized accelerator |
Ironwood's distinguishing factors in this landscape are the memory capacity per chip (192GB matches or exceeds most competitors), the pod scale (9,216 chips far exceeds any competitor's interconnected scale), and the JAX/XLA software co-design. The trade-off is complete vendor lock-in to Google Cloud.
Deployment Model Comparison. For organizations evaluating their inference infrastructure strategy, the choice between Ironwood, GPU cloud instances, and on-premises hardware involves multiple dimensions:
| Decision Factor | Ironwood (GCP) | GPU Cloud (Multi-provider) | On-Premises GPU |
|---|---|---|---|
| Capital Expenditure | None | None | High |
| Operational Flexibility | GCP-only | Multi-cloud possible | Full control |
| Scaling Speed | Minutes (if capacity available) | Minutes-hours | Weeks-months |
| Data Sovereignty | GCP data residency | Provider-dependent | Full control |
| Vendor Lock-in | High (Google Cloud + JAX) | Moderate (CUDA portable) | None |
| Cost Predictability | Committed-use = predictable | On-demand = variable | Fixed + power |
| Hardware Refresh | Automatic (Google manages) | Automatic | Manual procurement |
What We Still Do Not Know. Several critical details about Ironwood remain undisclosed as of this writing:
- Exact memory bandwidth - This determines single-chip inference throughput for memory-bound LLM workloads
- Peak compute TFLOPS - Without this, performance projections are speculative
- Power consumption - Affects operational costs and data center planning
- Pricing model - Per-chip-hour, per-token, or a new pricing structure
- JAX/XLA feature parity - Whether Ironwood supports all existing XLA operations or introduces new capabilities
- Availability timeline - H2 2025 is broad; the difference between July and December matters for planning
- Multi-tenancy support - Whether a single Ironwood chip can be shared across multiple users/models
Google's Strategic Position. Ironwood's announcement signals that Google views inference cost as the next battleground in cloud AI competition. By building purpose-built inference silicon, Google is betting that the marginal cost of serving a token will determine which cloud platform wins the AI API market. If Ironwood delivers on its architectural promise, Google Cloud could offer meaningfully lower per-token pricing than AWS or Azure - both of which depend on NVIDIA GPUs for inference. This would be a structural advantage that software optimization alone cannot replicate.
Cloud Pricing Context. While Ironwood pricing has not been announced, we can reason about where it might land based on Google's strategic incentives and the competitive landscape:
| Pricing Scenario | Per-Chip-Hour (est.) | Strategic Logic |
|---|---|---|
| Aggressive ($6-$8) | Low margin | Drive adoption, win inference market share from NVIDIA GPU clouds |
| Moderate ($8-$10) | Sustainable margin | Balance adoption with profitability |
| Premium ($10-$12) | High margin | Capture premium from 192GB memory advantage |
| Memory-proportional ($12-$15) | Pro-rata vs Trillium | 6x memory = 6x price (unlikely but possible) |
Google's history suggests aggressive pricing to drive adoption - they priced Trillium competitively with H100 instances and could follow the same playbook with Ironwood. The most likely scenario is moderate pricing ($8-$10/chip-hour on-demand) with committed-use discounts of 30-50%, targeting a per-token cost that undercuts GPU-based inference on any cloud.
Use Case Recommendations
Strong Fit:
- Google Cloud customers serving large LLMs at scale. If you are running inference for 70B+ parameter models on GCP and serving millions of requests per day, Ironwood is designed specifically for your workload. The combination of 192GB memory, inference-optimized silicon, and ICI pod scale should deliver the lowest cost-per-token on any cloud platform.
- Organizations building Gemini API-dependent applications. If your product relies on Google's Gemini API, you are already an Ironwood customer indirectly - Google serves Gemini on TPU infrastructure. As Ironwood ramps up, your API costs may decrease.
- Research teams studying inference optimization. Ironwood's inference-first architecture provides a unique platform for studying attention optimization, KV cache management, speculative decoding, and other inference-specific techniques.
- Multi-model serving with diverse model sizes. The 192GB per chip means a single Ironwood chip can serve multiple smaller models simultaneously (for example, a 7B model and a 13B model sharing one chip), simplifying multi-model deployments.
Weak Fit:
- Organizations that need training capacity. Ironwood is inference-only. If you need to train models, you need Trillium or GPU instances separately. This means two hardware platforms if you do both training and inference on GCP.
- Teams committed to PyTorch and CUDA. Ironwood runs JAX/XLA. If your entire inference stack is built on PyTorch with TensorRT-LLM or vLLM with CUDA backends, migrating to Ironwood requires significant re-engineering.
- Multi-cloud or on-premises requirements. Ironwood is GCP-only. Organizations with regulatory requirements for multi-cloud redundancy or on-premises data processing cannot use Ironwood as their primary inference platform.
- Small-scale inference deployments. If you are serving inference for a single model at modest throughput (thousands of requests per day rather than millions), the pod-scale advantages of Ironwood are irrelevant. A single GPU instance is simpler and sufficient.
- Organizations that need pricing certainty now. Ironwood pricing has not been announced. Procurement decisions that require firm pricing commitments cannot include Ironwood until Google publishes rates.
- Workloads requiring fine-grained GPU-level control. TPU programming through XLA is high-level and declarative. Teams that need precise control over memory layout, execution scheduling, or custom kernel operations will find GPU programming more flexible. The Pallas kernel language for TPUs exists but is less mature than CUDA.
Migration Path from Trillium to Ironwood. For existing Trillium users planning to adopt Ironwood for inference:
| Migration Step | Effort | Notes |
|---|---|---|
| Code compatibility | Low | JAX/XLA code should compile for both targets |
| Model checkpoint conversion | Low | Same format expected |
| Performance tuning | Medium | Ironwood may benefit from different batch sizes, sequence bucketing |
| Infrastructure orchestration | Medium | New instance types, different slice configurations |
| Cost optimization | Medium | New pricing tiers, different committed-use thresholds |
| Workload routing | Medium | Split training (Trillium) from inference (Ironwood) |
The transition from Trillium to Ironwood for inference should be smoother than any GPU-to-TPU migration because both products share the same XLA compilation infrastructure. The main optimization work will be tuning batch sizes and serving configurations for Ironwood's different memory profile (192GB per chip vs 32GB).
Strengths
- First TPU purpose-built for inference - silicon design decisions optimized for serving workloads
- ~192GB HBM per chip enables single-chip deployment of 70B-class models
- 9,216-chip pod scale with ICI interconnect - unmatched aggregate memory and bandwidth for inference farms
- ~1.7PB aggregate pod memory can serve trillion-parameter models at extreme throughput
- Google's vertical integration (silicon + interconnect + software + cloud) eliminates integration friction
- Builds on Google's decade of TPU production experience and JAX/XLA compiler optimization
- Powers Google's own Gemini inference - proven motivation to make the economics work
Weaknesses
- Very limited specifications disclosed - most performance claims are based on Google's marketing, not independent data
- Cloud-only availability means complete dependency on Google Cloud Platform
- Inference-only design means you cannot use Ironwood for training workloads (need separate Trillium capacity)
- JAX/XLA ecosystem lock-in - PyTorch workloads require non-trivial porting effort
- Pricing unknown - could be expensive if Google prices to reflect the massive memory increase
- No on-premises option for organizations with data sovereignty, compliance, or latency requirements
- Pod scale of 9,216 chips is impressive but only available to Google's largest customers
Related Coverage
- Google TPU v6e Trillium - Current-generation TPU for training and inference
- AMD Instinct MI300X - AMD's 192GB GPU, the closest memory-capacity competitor
- AMD Instinct MI350X - AMD's upcoming 288GB CDNA 4 GPU
- Huawei Ascend 910C - China's flagship AI accelerator
- Huawei Ascend 910B - Huawei's workhorse training chip
