TL;DR

Google's 7th-generation TPU - codenamed Ironwood - is purpose-built for inference rather than training
Massive memory per chip (reportedly 192GB HBM) designed to hold large models without multi-chip sharding
Scalable to pods of 9,216 chips interconnected via Google's next-gen ICI fabric
Announced at Google Cloud Next 2025, expected availability in H2 2025
Represents Google's bet that inference - not training - is the compute bottleneck for the next era of AI

Overview

The TPU v7 Ironwood is a strategic departure for Google's TPU program. Every previous TPU generation was designed primarily for training, with inference as a secondary consideration. Ironwood flips that priority. Announced at Google Cloud Next in April 2025, it is Google's first TPU built from the ground up for inference workloads - specifically, for serving the massive foundation models that Google and its Cloud customers are deploying at scale.

The logic behind this pivot is straightforward math. Training a frontier model is a one-time cost measured in tens of millions of dollars. Serving that model to millions of users every day is an ongoing cost that, over the model's lifetime, dwarfs the training budget by an order of magnitude or more. As AI shifts from a research exercise to a production infrastructure challenge, the hardware that matters most is the hardware that serves inference efficiently. Google appears to have concluded that training and inference workloads are different enough to justify purpose-built silicon for each.

The headline specification is memory. Ironwood reportedly packs 192GB of HBM per chip - a 6x increase over Trillium's 32GB. This is a direct response to the defining constraint of LLM inference: you need to fit the entire model (or a substantial shard of it) in memory, and the KV cache for long-context serving can consume hundreds of gigabytes across a batch. With 192GB per chip, Ironwood can hold a 70B-parameter model in BF16 on a single chip, or serve 405B-class models across just a few chips rather than dozens. Google has also disclosed that Ironwood pods can scale to 9,216 chips, which would provide approximately 1.7 petabytes of aggregate HBM - enough to serve the largest models at extreme throughput.

Key Specifications

Specification	Details
Manufacturer	Google
Product Family	Cloud TPU
Generation	v7 (Ironwood)
Process Node	Not disclosed
Chip Type	TPU (ASIC)
Primary Workload	Inference
Memory per Chip	~192GB HBM (reported)
Memory Bandwidth	Not disclosed
Peak Compute	Not disclosed (significant improvement over Trillium expected)
Supported Data Types	BF16, FP8, INT8 (expected, plus potential INT4)
Interconnect	Next-gen ICI (Inter-Chip Interconnect)
Max Pod Size	9,216 chips
Aggregate Pod Memory	~1.7 PB HBM
TDP	Not disclosed
Availability	Google Cloud Platform only (expected H2 2025)
Target Workload	Inference
Release Date	H2 2025 (expected)
Pricing	Cloud-only (not yet announced)

Note: Google has disclosed limited technical specifications for Ironwood. Many details in this page are based on the Cloud Next 2025 announcement, analyst reports, and informed estimates. This page will be updated as Google releases official specifications and pricing.

Performance Benchmarks (Estimated)

Benchmark / Metric	TPU v7 Ironwood (est.)	TPU v6e Trillium	NVIDIA H100 SXM	AMD MI300X
Memory per Chip	~192GB	32GB	80GB	192GB
Max Pod Size	9,216 chips	256 chips	Cluster-dependent	Cluster-dependent
Aggregate Pod Memory	~1.7 PB	8TB	Cluster-dependent	Cluster-dependent
Inference Throughput (relative)	Significant improvement	Baseline	Competitive	Competitive
Llama 70B per chip	Yes (BF16)	No (requires multi-chip)	Yes (FP8/INT8)	Yes (BF16)
Target Workload	Inference-optimized	Training + Inference	Training + Inference	Training + Inference
Availability	Cloud-only	Cloud-only	Purchase + Cloud	Purchase + Cloud

Direct performance comparisons are not yet possible because Google has not disclosed Ironwood's peak compute numbers or independent inference benchmarks. What we can reason about is the architectural intent. Ironwood's 192GB memory per chip matches the AMD MI300X and exceeds the H100's 80GB. Combined with Google's ICI interconnect at pod scale, this suggests Ironwood is designed to serve the largest models with minimal tensor parallelism overhead.

The pod scale is the real differentiator. A 9,216-chip Ironwood pod with 1.7PB of aggregate memory could theoretically serve a single trillion-parameter model with massive batch sizes, or run thousands of smaller model replicas for maximum throughput. No GPU cluster achieves this level of integration because GPU interconnects (NVLink, InfiniBand) introduce latency and bandwidth constraints that ICI avoids by design.

Key Capabilities

Inference-First Architecture. Ironwood's most significant design decision is the explicit prioritization of inference over training. Inference workloads have different characteristics than training - they are latency-sensitive, memory-bandwidth-bound (rather than compute-bound), and require efficient handling of variable-length requests with unpredictable arrival patterns. By designing Ironwood specifically for these characteristics, Google can make silicon trade-offs that a general-purpose training chip cannot. This includes optimizing for low-latency single-token generation, efficient KV cache management, and high-throughput batching across many concurrent requests.

Massive Memory for Large Model Serving. The ~192GB of HBM per chip directly addresses the biggest infrastructure challenge in LLM inference: fitting the model in memory. A Llama 3.1 70B model in BF16 requires approximately 140GB. On Trillium with 32GB per chip, you need at least 5 chips with tensor parallelism. On Ironwood, it fits on a single chip with room for KV cache. This reduction in parallelism degree directly translates to lower latency (no cross-chip communication for each token), simpler deployment, and better cost efficiency. For 405B-class models, Ironwood would require roughly 3 chips instead of the 13+ needed on Trillium.

9,216-Chip Pod Scale. Google disclosed that Ironwood pods can scale to 9,216 chips - a 36x increase over Trillium's 256-chip maximum. At this scale, a single Ironwood pod could serve inference for every Gemini API request across Google's entire customer base. The ICI interconnect at this scale needs to handle petabits per second of aggregate bandwidth with consistent low latency. Google's ability to build custom interconnect silicon alongside custom compute silicon is a structural advantage that neither NVIDIA nor AMD can replicate - they depend on third-party networks (InfiniBand, Ethernet) for inter-node communication.

Pricing and Availability

Google has not announced pricing for Ironwood. As a cloud-only product, pricing will follow Google Cloud's standard per-chip-hour model with on-demand, committed-use, and spot tiers.

Based on Trillium pricing and the significant memory increase, analyst estimates suggest Ironwood per-chip-hour costs could be 2-4x higher than Trillium's ~$3.22/chip-hour, reflecting the 6x memory increase and compute improvements. However, Google may price aggressively to drive adoption, particularly for inference workloads where it competes directly with NVIDIA GPU cloud instances.

Accelerator	Estimated Cost	Memory	Notes
TPU v7 Ironwood (est.)	~$6-$12/chip-hour	~192GB	Not yet available
TPU v6e Trillium	~$3.22/chip-hour	32GB	Available now
NVIDIA H100 (cloud)	~$3.50-$5.00/GPU-hour	80GB	Multiple cloud providers
AMD MI300X (cloud)	~$2.50-$3.50/GPU-hour	192GB	Azure, OCI

The cost comparison that matters is not per-chip-hour but per-million-tokens-served. Ironwood's inference optimization could deliver significantly more tokens per chip-hour than a general-purpose H100, making the higher per-chip cost irrelevant if throughput scales proportionally. Google has not disclosed these numbers, but they will be the key metric for evaluating Ironwood's value proposition once it launches.

Architecture Deep Dive

Ironwood represents a fundamental rethinking of what a TPU should be. Every previous TPU generation balanced training and inference. Ironwood drops training and optimizes everything - memory, compute, data movement, and scheduling - for the demands of serving large language models in production.

Inference-Optimized Compute Design. Training and inference have different compute profiles, and Ironwood's silicon reflects this:

Property	Training Workload	Inference Workload	Ironwood Optimization
Dominant Operation	Forward + backward pass	Forward pass only	No backward pass hardware
Batch Characteristics	Large, fixed batches	Variable, continuous batching	Dynamic batch scheduler
Latency Sensitivity	Low (throughput-oriented)	High (user-facing)	Low-latency execution paths
Memory Access Pattern	Sequential, predictable	Random KV cache reads	KV cache-aware memory controller
Precision Requirements	BF16/FP32 (gradient accuracy)	FP8/INT8 (throughput over precision)	Aggressive quantization support
Compute-to-Memory Ratio	High compute demand	High memory bandwidth demand	Memory-bandwidth optimized

By eliminating the backward pass hardware (gradient computation circuits, optimizer state memory management), Ironwood can dedicate more die area to inference-specific features: larger caches for KV storage, wider memory buses for parameter streaming, and specialized scheduling logic for continuous batching. Google has not disclosed the exact area savings, but for reference, backward pass support typically accounts for 20-30% of a training accelerator's transistor budget.

Memory Architecture. The jump from Trillium's 32GB to Ironwood's ~192GB per chip is a 6x increase - the largest memory expansion in TPU history. This is not just more of the same HBM - Ironwood likely uses HBM3 or HBM3e stacks with higher per-stack bandwidth, providing both the capacity and bandwidth that inference workloads demand.

Memory Property	TPU v6e Trillium	TPU v7 Ironwood (est.)	Improvement
Capacity per Chip	32GB	~192GB	6x
Memory Bandwidth	~1,600 GB/s	Not disclosed (est. 3,000-5,000 GB/s)	~2-3x (estimated)
HBM Generation	HBM (generation not disclosed)	HBM3 or HBM3e (estimated)	Newer generation
Aggregate Pod Memory	8TB (256 chips)	~1.7PB (9,216 chips)	~212x
KV Cache Budget per Chip	~10-15GB (after model)	~50-120GB (after model)	Massive increase

The KV cache budget is a critical metric for inference that is often overlooked. When serving a 70B model in BF16 (140GB weights), Trillium has essentially zero room for KV cache on a single chip - the model itself exceeds the 32GB capacity. On Ironwood with 192GB, a 70B BF16 model leaves ~52GB for KV cache, enabling long-context serving (128K+ token contexts) or high batch sizes without cross-chip communication for cache lookups.

9,216-Chip Pod Architecture. Ironwood's pod scale represents a 36x increase over Trillium's 256-chip maximum. The ICI interconnect at this scale requires a different topology than the 2D/3D torus used in smaller pods.

Pod Property	Trillium (256 chips)	Ironwood (9,216 chips)
Max Chips per Pod	256	9,216
Aggregate Memory	8TB	~1.7PB
Aggregate Compute	High	~36x Trillium (est.)
ICI Topology	2D/3D torus	Multi-level hierarchy (est.)
Bisection Bandwidth	High	Very high (est.)
Use Case	Training + inference	Inference farms

At 9,216 chips with 192GB each, an Ironwood pod provides approximately 1.7 petabytes of aggregate HBM. To put this in perspective: a Llama 405B model in BF16 requires ~810GB. A single Ironwood pod could serve over 2,000 independent replicas of this model simultaneously, or a single model with massive tensor parallelism for ultra-low-latency serving. This scale is designed for Google's internal Gemini serving needs - millions of API calls per minute across Google Search, Workspace, and Cloud API customers.

Inference-Specific Hardware Features. Based on Google's announcements and industry analysis, Ironwood likely includes several hardware features specifically targeting inference efficiency:

Continuous batching support - Hardware-level scheduling that can add new requests to an in-flight batch without waiting for the longest sequence to complete
KV cache compression - Hardware-assisted compression of attention key-value caches to reduce memory footprint for long-context serving
Speculative decoding support - Hardware paths optimized for running a small draft model and a large verification model in tandem
INT4/INT8 native support - Low-precision integer operations for maximizing throughput on quantized models
Prefill/decode separation - Ability to run prefill (compute-bound) and decode (memory-bound) phases on different chip configurations within the same pod

Comparison to Trillium Architecture. Understanding what changed between generations clarifies Ironwood's design philosophy:

Architectural Decision	Trillium (Training-focused)	Ironwood (Inference-focused)
Backward Pass Support	Full gradient computation hardware	Minimal or none
Memory Priority	Balanced (weights + activations + gradients)	Weights + KV cache optimized
Batch Scheduling	Static batch, fixed size	Dynamic continuous batching
Precision Focus	BF16 primary (gradient accuracy)	FP8/INT8 primary (throughput)
Pod Scale	256 chips (training parallelism)	9,216 chips (inference throughput)
ICI Optimization	All-reduce for gradient sync	Point-to-point for model sharding
Power Profile	Sustained high compute	Burst compute with idle periods

The architectural differences reflect a fundamental insight about inference workloads: they are bursty, latency-sensitive, and memory-dominated. A token generation step reads the entire model from memory but performs relatively little computation per byte read. This makes memory bandwidth, not compute throughput, the primary bottleneck. Ironwood's design appears to prioritize bandwidth and capacity over raw TFLOPS.

Real-World Performance Analysis

Since Ironwood has not launched at the time of writing, no independent benchmarks exist. However, we can construct a performance framework based on Ironwood's known specifications and the inference workload characteristics it targets.

Projected Single-Chip Inference Performance. Based on the 192GB memory per chip and estimated memory bandwidth improvements over Trillium:

Model	Precision	Ironwood Single Chip (est.)	Trillium (min chips needed)	Notes
Llama 3.1 70B	BF16	40-60 tok/s (est.)	5+ chips	Ironwood: single chip, memory-bandwidth bound
Llama 3.1 70B	FP8	70-100 tok/s (est.)	3+ chips	FP8 halves memory, doubles effective bandwidth
Llama 3.1 8B	BF16	200-300 tok/s (est.)	1 chip	Compute-bound on both
Llama 3.1 405B	FP8	N/A (requires 2+ chips)	13+ chips	405GB FP8 > 192GB
Gemini Ultra	Google-optimized	Baseline (internal)	Slower (internal)	Google's primary use case

These are rough estimates based on architectural reasoning, not measured data. Actual performance will depend on Ironwood's memory bandwidth (undisclosed), compute throughput (undisclosed), and XLA compiler optimizations at launch.

Pod-Scale Inference Economics. The real value proposition of Ironwood is not single-chip performance but pod-scale inference cost efficiency. Google's goal is to reduce the cost-per-token for serving Gemini and other large models to its Cloud customers. The relevant metric is cost per million output tokens:

Platform	Estimated Cost per 1M Output Tokens (70B model)	Notes
Ironwood (committed pricing, est.)	$0.20-$0.50 (est.)	Aggressive estimate based on efficiency claims
Trillium (committed pricing)	$0.50-$1.00 (est.)	Multi-chip overhead increases cost
H100 Cloud (on-demand)	$0.80-$1.50	Varies by provider, model optimization
MI300X Cloud (on-demand)	$0.60-$1.20	Memory advantage helps, fewer providers

If Google can achieve the lower end of these estimates, Ironwood could make Google Cloud the most cost-effective platform for LLM inference serving - which is precisely the strategic position Google wants.

Long-Context Inference Considerations. One area where Ironwood's architecture could provide significant advantages is long-context inference (128K+ token contexts). Long contexts require enormous KV caches - a 70B model with 128K context and batch size 1 can consume 40-60GB of KV cache memory. On Ironwood with 192GB per chip and ~52GB available after model weights, a single chip could handle this. On Trillium with 32GB per chip, the KV cache alone would require multiple chips, with cross-chip latency on every attention computation.

Context Length	KV Cache Size (70B model, BF16)	Ironwood Chips Needed	Trillium Chips Needed
4K tokens	~2-3GB	1 (fits easily)	5+ (model-limited)
32K tokens	~10-15GB	1 (fits easily)	5+ (model-limited)
128K tokens	~40-60GB	1 (tight fit)	7-8 (cache-limited)
1M tokens	~300-500GB	2-3	15-20

For applications like document analysis, code generation with large codebases, and multi-turn conversations with long histories, Ironwood's per-chip memory budget is a significant advantage. Google's Gemini models already support 1M+ token contexts, and serving these efficiently requires the kind of memory density that Ironwood provides.

Comparison with GPU-Based Inference. The key architectural advantages Ironwood holds for inference versus GPU-based solutions:

Advantage Area	Ironwood (TPU)	GPU (H100/B200)	Impact
Software-hardware co-design	XLA compiles directly for Ironwood silicon	CUDA generic + TensorRT optimization layer	Higher utilization on TPU
Interconnect at scale	ICI across 9,216 chips	NVLink (8 GPUs) + InfiniBand (inter-node)	Lower latency at scale on TPU
Inference specialization	Purpose-built, no training overhead	General-purpose, carries training hardware	Better efficiency per watt on TPU
Memory per chip	~192GB (matches MI300X)	80GB (H100) / 192GB (B200/MI300X)	Competitive
Software ecosystem	JAX/XLA only	CUDA, PyTorch, TensorRT, vLLM, etc.	Much broader on GPU

Generational and Competitive Context

vs. TPU v6e Trillium. Ironwood is not a direct successor to Trillium - it is a parallel product line optimized for a different workload. Google's emerging strategy is Trillium for training, Ironwood for inference. Customers who currently use Trillium for both training and inference should plan to migrate inference workloads to Ironwood once it becomes available, while keeping Trillium for training. JAX code should port between the two with minimal changes - XLA compilation handles the hardware differences.

vs. NVIDIA B200/GB300. NVIDIA's Blackwell generation is Ironwood's closest competitor in the inference market. The B200 offers 192GB HBM3e and ~4,500 TFLOPS FP8, but it is a general-purpose accelerator that handles both training and inference. Ironwood's advantage is inference specialization and pod-scale ICI interconnect. NVIDIA's advantage is software ecosystem breadth - TensorRT-LLM, vLLM with CUDA, and the entire PyTorch inference toolchain. For organizations running on Google Cloud, Ironwood will likely be more cost-effective for Transformer inference. For organizations that need hardware flexibility or multi-cloud deployment, NVIDIA remains the default choice.

vs. AMD MI300X/MI350X. The MI300X matches Ironwood's 192GB memory per chip and is available for purchase - not cloud-only. The MI350X goes further with 288GB. For organizations that want to own their inference hardware, AMD's Instinct line is the relevant comparison. Ironwood targets a different buyer - cloud-native organizations that prefer operational expenditure over capital expenditure and are willing to accept Google Cloud lock-in in exchange for purpose-built inference efficiency.

vs. Huawei Ascend 910C. The Ascend 910C is not a direct competitor in any practical sense - Ironwood is a cloud-only product on Google Cloud, and the 910C is available only in China. However, the comparison is instructive for understanding the state of AI accelerator diversity. Ironwood with 192GB HBM and purpose-built inference optimization is at the high end of what inference silicon can deliver. The 910C with 96GB HBM2e and ~1,800 GB/s bandwidth represents what is achievable under export control constraints. The performance gap between the two reflects the manufacturing technology gap between TSMC-class and SMIC-class foundries.

Inference Architecture Market Trend. Ironwood is not the only inference-specialized chip entering the market. The broader trend toward inference-specific silicon reflects the industry's recognition that training and inference are fundamentally different workloads:

Inference-Focused Product	Manufacturer	Memory	Availability	Approach
TPU v7 Ironwood	Google	~192GB	Cloud-only (GCP)	Custom ASIC, ICI interconnect
NVIDIA GB300 NVL72	NVIDIA	1,664GB (aggregate 72 GPUs)	Purchase + cloud	GPU rack-scale inference
Groq LPU	Groq	230MB SRAM (no HBM)	Cloud API	Ultra-low-latency ASIC
AWS Inferentia2	Amazon	32GB HBM2e per chip	Cloud-only (AWS)	Custom inference ASIC
Intel Gaudi 3	Intel	128GB HBM2e	Purchase + cloud	Inference-optimized accelerator

Ironwood's distinguishing factors in this landscape are the memory capacity per chip (192GB matches or exceeds most competitors), the pod scale (9,216 chips far exceeds any competitor's interconnected scale), and the JAX/XLA software co-design. The trade-off is complete vendor lock-in to Google Cloud.

Deployment Model Comparison. For organizations evaluating their inference infrastructure strategy, the choice between Ironwood, GPU cloud instances, and on-premises hardware involves multiple dimensions:

Decision Factor	Ironwood (GCP)	GPU Cloud (Multi-provider)	On-Premises GPU
Capital Expenditure	None	None	High
Operational Flexibility	GCP-only	Multi-cloud possible	Full control
Scaling Speed	Minutes (if capacity available)	Minutes-hours	Weeks-months
Data Sovereignty	GCP data residency	Provider-dependent	Full control
Vendor Lock-in	High (Google Cloud + JAX)	Moderate (CUDA portable)	None
Cost Predictability	Committed-use = predictable	On-demand = variable	Fixed + power
Hardware Refresh	Automatic (Google manages)	Automatic	Manual procurement

What We Still Do Not Know. Several critical details about Ironwood remain undisclosed as of this writing:

Exact memory bandwidth - This determines single-chip inference throughput for memory-bound LLM workloads
Peak compute TFLOPS - Without this, performance projections are speculative
Power consumption - Affects operational costs and data center planning
Pricing model - Per-chip-hour, per-token, or a new pricing structure
JAX/XLA feature parity - Whether Ironwood supports all existing XLA operations or introduces new capabilities
Availability timeline - H2 2025 is broad; the difference between July and December matters for planning
Multi-tenancy support - Whether a single Ironwood chip can be shared across multiple users/models

Google's Strategic Position. Ironwood's announcement signals that Google views inference cost as the next battleground in cloud AI competition. By building purpose-built inference silicon, Google is betting that the marginal cost of serving a token will determine which cloud platform wins the AI API market. If Ironwood delivers on its architectural promise, Google Cloud could offer meaningfully lower per-token pricing than AWS or Azure - both of which depend on NVIDIA GPUs for inference. This would be a structural advantage that software optimization alone cannot replicate.

Cloud Pricing Context. While Ironwood pricing has not been announced, we can reason about where it might land based on Google's strategic incentives and the competitive landscape:

Pricing Scenario	Per-Chip-Hour (est.)	Strategic Logic
Aggressive ($6-$8)	Low margin	Drive adoption, win inference market share from NVIDIA GPU clouds
Moderate ($8-$10)	Sustainable margin	Balance adoption with profitability
Premium ($10-$12)	High margin	Capture premium from 192GB memory advantage
Memory-proportional ($12-$15)	Pro-rata vs Trillium	6x memory = 6x price (unlikely but possible)

Google's history suggests aggressive pricing to drive adoption - they priced Trillium competitively with H100 instances and could follow the same playbook with Ironwood. The most likely scenario is moderate pricing ($8-$10/chip-hour on-demand) with committed-use discounts of 30-50%, targeting a per-token cost that undercuts GPU-based inference on any cloud.

Use Case Recommendations

Strong Fit:

Google Cloud customers serving large LLMs at scale. If you are running inference for 70B+ parameter models on GCP and serving millions of requests per day, Ironwood is designed specifically for your workload. The combination of 192GB memory, inference-optimized silicon, and ICI pod scale should deliver the lowest cost-per-token on any cloud platform.
Organizations building Gemini API-dependent applications. If your product relies on Google's Gemini API, you are already an Ironwood customer indirectly - Google serves Gemini on TPU infrastructure. As Ironwood ramps up, your API costs may decrease.
Research teams studying inference optimization. Ironwood's inference-first architecture provides a unique platform for studying attention optimization, KV cache management, speculative decoding, and other inference-specific techniques.
Multi-model serving with diverse model sizes. The 192GB per chip means a single Ironwood chip can serve multiple smaller models simultaneously (for example, a 7B model and a 13B model sharing one chip), simplifying multi-model deployments.

Weak Fit:

Organizations that need training capacity. Ironwood is inference-only. If you need to train models, you need Trillium or GPU instances separately. This means two hardware platforms if you do both training and inference on GCP.
Teams committed to PyTorch and CUDA. Ironwood runs JAX/XLA. If your entire inference stack is built on PyTorch with TensorRT-LLM or vLLM with CUDA backends, migrating to Ironwood requires significant re-engineering.
Multi-cloud or on-premises requirements. Ironwood is GCP-only. Organizations with regulatory requirements for multi-cloud redundancy or on-premises data processing cannot use Ironwood as their primary inference platform.
Small-scale inference deployments. If you are serving inference for a single model at modest throughput (thousands of requests per day rather than millions), the pod-scale advantages of Ironwood are irrelevant. A single GPU instance is simpler and sufficient.
Organizations that need pricing certainty now. Ironwood pricing has not been announced. Procurement decisions that require firm pricing commitments cannot include Ironwood until Google publishes rates.
Workloads requiring fine-grained GPU-level control. TPU programming through XLA is high-level and declarative. Teams that need precise control over memory layout, execution scheduling, or custom kernel operations will find GPU programming more flexible. The Pallas kernel language for TPUs exists but is less mature than CUDA.

Migration Path from Trillium to Ironwood. For existing Trillium users planning to adopt Ironwood for inference:

Migration Step	Effort	Notes
Code compatibility	Low	JAX/XLA code should compile for both targets
Model checkpoint conversion	Low	Same format expected
Performance tuning	Medium	Ironwood may benefit from different batch sizes, sequence bucketing
Infrastructure orchestration	Medium	New instance types, different slice configurations
Cost optimization	Medium	New pricing tiers, different committed-use thresholds
Workload routing	Medium	Split training (Trillium) from inference (Ironwood)

The transition from Trillium to Ironwood for inference should be smoother than any GPU-to-TPU migration because both products share the same XLA compilation infrastructure. The main optimization work will be tuning batch sizes and serving configurations for Ironwood's different memory profile (192GB per chip vs 32GB).

Strengths

First TPU purpose-built for inference - silicon design decisions optimized for serving workloads
~192GB HBM per chip enables single-chip deployment of 70B-class models
9,216-chip pod scale with ICI interconnect - unmatched aggregate memory and bandwidth for inference farms
~1.7PB aggregate pod memory can serve trillion-parameter models at extreme throughput
Google's vertical integration (silicon + interconnect + software + cloud) eliminates integration friction
Builds on Google's decade of TPU production experience and JAX/XLA compiler optimization
Powers Google's own Gemini inference - proven motivation to make the economics work

Weaknesses

Very limited specifications disclosed - most performance claims are based on Google's marketing, not independent data
Cloud-only availability means complete dependency on Google Cloud Platform
Inference-only design means you cannot use Ironwood for training workloads (need separate Trillium capacity)
JAX/XLA ecosystem lock-in - PyTorch workloads require non-trivial porting effort
Pricing unknown - could be expensive if Google prices to reflect the massive memory increase
No on-premises option for organizations with data sovereignty, compliance, or latency requirements
Pod scale of 9,216 chips is impressive but only available to Google's largest customers

Google TPU v6e Trillium - Current-generation TPU for training and inference
AMD Instinct MI300X - AMD's 192GB GPU, the closest memory-capacity competitor
AMD Instinct MI350X - AMD's upcoming 288GB CDNA 4 GPU
Huawei Ascend 910C - China's flagship AI accelerator
Huawei Ascend 910B - Huawei's workhorse training chip

Google TPU v7 Ironwood

Overview

Key Specifications

Performance Benchmarks (Estimated)

Key Capabilities

Pricing and Availability

Architecture Deep Dive

Real-World Performance Analysis

Generational and Competitive Context

Use Case Recommendations

Strengths

Weaknesses

Sources

Overview

Key Specifications

Performance Benchmarks (Estimated)

Key Capabilities

Pricing and Availability

Architecture Deep Dive

Real-World Performance Analysis

Generational and Competitive Context

Use Case Recommendations

Strengths

Weaknesses

Related Coverage

Sources

Google Analytics