Hardware

Google TPU v7 Ironwood

Google TPU v7 Ironwood specs, architecture, and performance estimates. Google's next-gen inference-optimized TPU with massive memory per chip, announced at Cloud Next 2025.

Google TPU v7 Ironwood

TL;DR

  • Google's 7th-generation TPU - codenamed Ironwood - is purpose-built for inference rather than training
  • Massive memory per chip (reportedly 192GB HBM) designed to hold large models without multi-chip sharding
  • Scalable to pods of 9,216 chips interconnected via Google's next-gen ICI fabric
  • Announced at Google Cloud Next 2025, expected availability in H2 2025
  • Represents Google's bet that inference - not training - is the compute bottleneck for the next era of AI

Overview

The TPU v7 Ironwood is a strategic departure for Google's TPU program. Every previous TPU generation was designed primarily for training, with inference as a secondary consideration. Ironwood flips that priority. Announced at Google Cloud Next in April 2025, it is Google's first TPU built from the ground up for inference workloads - specifically, for serving the massive foundation models that Google and its Cloud customers are deploying at scale.

The logic behind this pivot is straightforward math. Training a frontier model is a one-time cost measured in tens of millions of dollars. Serving that model to millions of users every day is an ongoing cost that, over the model's lifetime, dwarfs the training budget by an order of magnitude or more. As AI shifts from a research exercise to a production infrastructure challenge, the hardware that matters most is the hardware that serves inference efficiently. Google appears to have concluded that training and inference workloads are different enough to justify purpose-built silicon for each.

The headline specification is memory. Ironwood reportedly packs 192GB of HBM per chip - a 6x increase over Trillium's 32GB. This is a direct response to the defining constraint of LLM inference: you need to fit the entire model (or a substantial shard of it) in memory, and the KV cache for long-context serving can consume hundreds of gigabytes across a batch. With 192GB per chip, Ironwood can hold a 70B-parameter model in BF16 on a single chip, or serve 405B-class models across just a few chips rather than dozens. Google has also disclosed that Ironwood pods can scale to 9,216 chips, which would provide approximately 1.7 petabytes of aggregate HBM - enough to serve the largest models at extreme throughput.

Key Specifications

SpecificationDetails
ManufacturerGoogle
Product FamilyCloud TPU
Generationv7 (Ironwood)
Process NodeNot disclosed
Chip TypeTPU (ASIC)
Primary WorkloadInference
Memory per Chip~192GB HBM (reported)
Memory BandwidthNot disclosed
Peak ComputeNot disclosed (significant improvement over Trillium expected)
Supported Data TypesBF16, FP8, INT8 (expected, plus potential INT4)
InterconnectNext-gen ICI (Inter-Chip Interconnect)
Max Pod Size9,216 chips
Aggregate Pod Memory~1.7 PB HBM
TDPNot disclosed
AvailabilityGoogle Cloud Platform only (expected H2 2025)
Target WorkloadInference
Release DateH2 2025 (expected)
PricingCloud-only (not yet announced)

Note: Google has disclosed limited technical specifications for Ironwood. Many details in this page are based on the Cloud Next 2025 announcement, analyst reports, and informed estimates. This page will be updated as Google releases official specifications and pricing.

Performance Benchmarks (Estimated)

Benchmark / MetricTPU v7 Ironwood (est.)TPU v6e TrilliumNVIDIA H100 SXMAMD MI300X
Memory per Chip~192GB32GB80GB192GB
Max Pod Size9,216 chips256 chipsCluster-dependentCluster-dependent
Aggregate Pod Memory~1.7 PB8TBCluster-dependentCluster-dependent
Inference Throughput (relative)Significant improvementBaselineCompetitiveCompetitive
Llama 70B per chipYes (BF16)No (requires multi-chip)Yes (FP8/INT8)Yes (BF16)
Target WorkloadInference-optimizedTraining + InferenceTraining + InferenceTraining + Inference
AvailabilityCloud-onlyCloud-onlyPurchase + CloudPurchase + Cloud

Direct performance comparisons are not yet possible because Google has not disclosed Ironwood's peak compute numbers or independent inference benchmarks. What we can reason about is the architectural intent. Ironwood's 192GB memory per chip matches the AMD MI300X and exceeds the H100's 80GB. Combined with Google's ICI interconnect at pod scale, this suggests Ironwood is designed to serve the largest models with minimal tensor parallelism overhead.

The pod scale is the real differentiator. A 9,216-chip Ironwood pod with 1.7PB of aggregate memory could theoretically serve a single trillion-parameter model with massive batch sizes, or run thousands of smaller model replicas for maximum throughput. No GPU cluster achieves this level of integration because GPU interconnects (NVLink, InfiniBand) introduce latency and bandwidth constraints that ICI avoids by design.

Key Capabilities

Inference-First Architecture. Ironwood's most significant design decision is the explicit prioritization of inference over training. Inference workloads have different characteristics than training - they are latency-sensitive, memory-bandwidth-bound (rather than compute-bound), and require efficient handling of variable-length requests with unpredictable arrival patterns. By designing Ironwood specifically for these characteristics, Google can make silicon trade-offs that a general-purpose training chip cannot. This includes optimizing for low-latency single-token generation, efficient KV cache management, and high-throughput batching across many concurrent requests.

Massive Memory for Large Model Serving. The ~192GB of HBM per chip directly addresses the biggest infrastructure challenge in LLM inference: fitting the model in memory. A Llama 3.1 70B model in BF16 requires approximately 140GB. On Trillium with 32GB per chip, you need at least 5 chips with tensor parallelism. On Ironwood, it fits on a single chip with room for KV cache. This reduction in parallelism degree directly translates to lower latency (no cross-chip communication for each token), simpler deployment, and better cost efficiency. For 405B-class models, Ironwood would require roughly 3 chips instead of the 13+ needed on Trillium.

9,216-Chip Pod Scale. Google disclosed that Ironwood pods can scale to 9,216 chips - a 36x increase over Trillium's 256-chip maximum. At this scale, a single Ironwood pod could serve inference for every Gemini API request across Google's entire customer base. The ICI interconnect at this scale needs to handle petabits per second of aggregate bandwidth with consistent low latency. Google's ability to build custom interconnect silicon alongside custom compute silicon is a structural advantage that neither NVIDIA nor AMD can replicate - they depend on third-party networks (InfiniBand, Ethernet) for inter-node communication.

Pricing and Availability

Google has not announced pricing for Ironwood. As a cloud-only product, pricing will follow Google Cloud's standard per-chip-hour model with on-demand, committed-use, and spot tiers.

Based on Trillium pricing and the significant memory increase, analyst estimates suggest Ironwood per-chip-hour costs could be 2-4x higher than Trillium's ~$3.22/chip-hour, reflecting the 6x memory increase and compute improvements. However, Google may price aggressively to drive adoption, particularly for inference workloads where it competes directly with NVIDIA GPU cloud instances.

AcceleratorEstimated CostMemoryNotes
TPU v7 Ironwood (est.)~$6-$12/chip-hour~192GBNot yet available
TPU v6e Trillium~$3.22/chip-hour32GBAvailable now
NVIDIA H100 (cloud)~$3.50-$5.00/GPU-hour80GBMultiple cloud providers
AMD MI300X (cloud)~$2.50-$3.50/GPU-hour192GBAzure, OCI

The cost comparison that matters is not per-chip-hour but per-million-tokens-served. Ironwood's inference optimization could deliver significantly more tokens per chip-hour than a general-purpose H100, making the higher per-chip cost irrelevant if throughput scales proportionally. Google has not disclosed these numbers, but they will be the key metric for evaluating Ironwood's value proposition once it launches.

Architecture Deep Dive

Ironwood represents a fundamental rethinking of what a TPU should be. Every previous TPU generation balanced training and inference. Ironwood drops training and optimizes everything - memory, compute, data movement, and scheduling - for the demands of serving large language models in production.

Inference-Optimized Compute Design. Training and inference have different compute profiles, and Ironwood's silicon reflects this:

PropertyTraining WorkloadInference WorkloadIronwood Optimization
Dominant OperationForward + backward passForward pass onlyNo backward pass hardware
Batch CharacteristicsLarge, fixed batchesVariable, continuous batchingDynamic batch scheduler
Latency SensitivityLow (throughput-oriented)High (user-facing)Low-latency execution paths
Memory Access PatternSequential, predictableRandom KV cache readsKV cache-aware memory controller
Precision RequirementsBF16/FP32 (gradient accuracy)FP8/INT8 (throughput over precision)Aggressive quantization support
Compute-to-Memory RatioHigh compute demandHigh memory bandwidth demandMemory-bandwidth optimized

By eliminating the backward pass hardware (gradient computation circuits, optimizer state memory management), Ironwood can dedicate more die area to inference-specific features: larger caches for KV storage, wider memory buses for parameter streaming, and specialized scheduling logic for continuous batching. Google has not disclosed the exact area savings, but for reference, backward pass support typically accounts for 20-30% of a training accelerator's transistor budget.

Memory Architecture. The jump from Trillium's 32GB to Ironwood's ~192GB per chip is a 6x increase - the largest memory expansion in TPU history. This is not just more of the same HBM - Ironwood likely uses HBM3 or HBM3e stacks with higher per-stack bandwidth, providing both the capacity and bandwidth that inference workloads demand.

Memory PropertyTPU v6e TrilliumTPU v7 Ironwood (est.)Improvement
Capacity per Chip32GB~192GB6x
Memory Bandwidth~1,600 GB/sNot disclosed (est. 3,000-5,000 GB/s)~2-3x (estimated)
HBM GenerationHBM (generation not disclosed)HBM3 or HBM3e (estimated)Newer generation
Aggregate Pod Memory8TB (256 chips)~1.7PB (9,216 chips)~212x
KV Cache Budget per Chip~10-15GB (after model)~50-120GB (after model)Massive increase

The KV cache budget is a critical metric for inference that is often overlooked. When serving a 70B model in BF16 (140GB weights), Trillium has essentially zero room for KV cache on a single chip - the model itself exceeds the 32GB capacity. On Ironwood with 192GB, a 70B BF16 model leaves ~52GB for KV cache, enabling long-context serving (128K+ token contexts) or high batch sizes without cross-chip communication for cache lookups.

9,216-Chip Pod Architecture. Ironwood's pod scale represents a 36x increase over Trillium's 256-chip maximum. The ICI interconnect at this scale requires a different topology than the 2D/3D torus used in smaller pods.

Pod PropertyTrillium (256 chips)Ironwood (9,216 chips)
Max Chips per Pod2569,216
Aggregate Memory8TB~1.7PB
Aggregate ComputeHigh~36x Trillium (est.)
ICI Topology2D/3D torusMulti-level hierarchy (est.)
Bisection BandwidthHighVery high (est.)
Use CaseTraining + inferenceInference farms

At 9,216 chips with 192GB each, an Ironwood pod provides approximately 1.7 petabytes of aggregate HBM. To put this in perspective: a Llama 405B model in BF16 requires ~810GB. A single Ironwood pod could serve over 2,000 independent replicas of this model simultaneously, or a single model with massive tensor parallelism for ultra-low-latency serving. This scale is designed for Google's internal Gemini serving needs - millions of API calls per minute across Google Search, Workspace, and Cloud API customers.

Inference-Specific Hardware Features. Based on Google's announcements and industry analysis, Ironwood likely includes several hardware features specifically targeting inference efficiency:

  • Continuous batching support - Hardware-level scheduling that can add new requests to an in-flight batch without waiting for the longest sequence to complete
  • KV cache compression - Hardware-assisted compression of attention key-value caches to reduce memory footprint for long-context serving
  • Speculative decoding support - Hardware paths optimized for running a small draft model and a large verification model in tandem
  • INT4/INT8 native support - Low-precision integer operations for maximizing throughput on quantized models
  • Prefill/decode separation - Ability to run prefill (compute-bound) and decode (memory-bound) phases on different chip configurations within the same pod

Comparison to Trillium Architecture. Understanding what changed between generations clarifies Ironwood's design philosophy:

Architectural DecisionTrillium (Training-focused)Ironwood (Inference-focused)
Backward Pass SupportFull gradient computation hardwareMinimal or none
Memory PriorityBalanced (weights + activations + gradients)Weights + KV cache optimized
Batch SchedulingStatic batch, fixed sizeDynamic continuous batching
Precision FocusBF16 primary (gradient accuracy)FP8/INT8 primary (throughput)
Pod Scale256 chips (training parallelism)9,216 chips (inference throughput)
ICI OptimizationAll-reduce for gradient syncPoint-to-point for model sharding
Power ProfileSustained high computeBurst compute with idle periods

The architectural differences reflect a fundamental insight about inference workloads: they are bursty, latency-sensitive, and memory-dominated. A token generation step reads the entire model from memory but performs relatively little computation per byte read. This makes memory bandwidth, not compute throughput, the primary bottleneck. Ironwood's design appears to prioritize bandwidth and capacity over raw TFLOPS.

Real-World Performance Analysis

Since Ironwood has not launched at the time of writing, no independent benchmarks exist. However, we can construct a performance framework based on Ironwood's known specifications and the inference workload characteristics it targets.

Projected Single-Chip Inference Performance. Based on the 192GB memory per chip and estimated memory bandwidth improvements over Trillium:

ModelPrecisionIronwood Single Chip (est.)Trillium (min chips needed)Notes
Llama 3.1 70BBF1640-60 tok/s (est.)5+ chipsIronwood: single chip, memory-bandwidth bound
Llama 3.1 70BFP870-100 tok/s (est.)3+ chipsFP8 halves memory, doubles effective bandwidth
Llama 3.1 8BBF16200-300 tok/s (est.)1 chipCompute-bound on both
Llama 3.1 405BFP8N/A (requires 2+ chips)13+ chips405GB FP8 > 192GB
Gemini UltraGoogle-optimizedBaseline (internal)Slower (internal)Google's primary use case

These are rough estimates based on architectural reasoning, not measured data. Actual performance will depend on Ironwood's memory bandwidth (undisclosed), compute throughput (undisclosed), and XLA compiler optimizations at launch.

Pod-Scale Inference Economics. The real value proposition of Ironwood is not single-chip performance but pod-scale inference cost efficiency. Google's goal is to reduce the cost-per-token for serving Gemini and other large models to its Cloud customers. The relevant metric is cost per million output tokens:

PlatformEstimated Cost per 1M Output Tokens (70B model)Notes
Ironwood (committed pricing, est.)$0.20-$0.50 (est.)Aggressive estimate based on efficiency claims
Trillium (committed pricing)$0.50-$1.00 (est.)Multi-chip overhead increases cost
H100 Cloud (on-demand)$0.80-$1.50Varies by provider, model optimization
MI300X Cloud (on-demand)$0.60-$1.20Memory advantage helps, fewer providers

If Google can achieve the lower end of these estimates, Ironwood could make Google Cloud the most cost-effective platform for LLM inference serving - which is precisely the strategic position Google wants.

Long-Context Inference Considerations. One area where Ironwood's architecture could provide significant advantages is long-context inference (128K+ token contexts). Long contexts require enormous KV caches - a 70B model with 128K context and batch size 1 can consume 40-60GB of KV cache memory. On Ironwood with 192GB per chip and ~52GB available after model weights, a single chip could handle this. On Trillium with 32GB per chip, the KV cache alone would require multiple chips, with cross-chip latency on every attention computation.

Context LengthKV Cache Size (70B model, BF16)Ironwood Chips NeededTrillium Chips Needed
4K tokens~2-3GB1 (fits easily)5+ (model-limited)
32K tokens~10-15GB1 (fits easily)5+ (model-limited)
128K tokens~40-60GB1 (tight fit)7-8 (cache-limited)
1M tokens~300-500GB2-315-20

For applications like document analysis, code generation with large codebases, and multi-turn conversations with long histories, Ironwood's per-chip memory budget is a significant advantage. Google's Gemini models already support 1M+ token contexts, and serving these efficiently requires the kind of memory density that Ironwood provides.

Comparison with GPU-Based Inference. The key architectural advantages Ironwood holds for inference versus GPU-based solutions:

Advantage AreaIronwood (TPU)GPU (H100/B200)Impact
Software-hardware co-designXLA compiles directly for Ironwood siliconCUDA generic + TensorRT optimization layerHigher utilization on TPU
Interconnect at scaleICI across 9,216 chipsNVLink (8 GPUs) + InfiniBand (inter-node)Lower latency at scale on TPU
Inference specializationPurpose-built, no training overheadGeneral-purpose, carries training hardwareBetter efficiency per watt on TPU
Memory per chip~192GB (matches MI300X)80GB (H100) / 192GB (B200/MI300X)Competitive
Software ecosystemJAX/XLA onlyCUDA, PyTorch, TensorRT, vLLM, etc.Much broader on GPU

Generational and Competitive Context

vs. TPU v6e Trillium. Ironwood is not a direct successor to Trillium - it is a parallel product line optimized for a different workload. Google's emerging strategy is Trillium for training, Ironwood for inference. Customers who currently use Trillium for both training and inference should plan to migrate inference workloads to Ironwood once it becomes available, while keeping Trillium for training. JAX code should port between the two with minimal changes - XLA compilation handles the hardware differences.

vs. NVIDIA B200/GB300. NVIDIA's Blackwell generation is Ironwood's closest competitor in the inference market. The B200 offers 192GB HBM3e and ~4,500 TFLOPS FP8, but it is a general-purpose accelerator that handles both training and inference. Ironwood's advantage is inference specialization and pod-scale ICI interconnect. NVIDIA's advantage is software ecosystem breadth - TensorRT-LLM, vLLM with CUDA, and the entire PyTorch inference toolchain. For organizations running on Google Cloud, Ironwood will likely be more cost-effective for Transformer inference. For organizations that need hardware flexibility or multi-cloud deployment, NVIDIA remains the default choice.

vs. AMD MI300X/MI350X. The MI300X matches Ironwood's 192GB memory per chip and is available for purchase - not cloud-only. The MI350X goes further with 288GB. For organizations that want to own their inference hardware, AMD's Instinct line is the relevant comparison. Ironwood targets a different buyer - cloud-native organizations that prefer operational expenditure over capital expenditure and are willing to accept Google Cloud lock-in in exchange for purpose-built inference efficiency.

vs. Huawei Ascend 910C. The Ascend 910C is not a direct competitor in any practical sense - Ironwood is a cloud-only product on Google Cloud, and the 910C is available only in China. However, the comparison is instructive for understanding the state of AI accelerator diversity. Ironwood with 192GB HBM and purpose-built inference optimization is at the high end of what inference silicon can deliver. The 910C with 96GB HBM2e and ~1,800 GB/s bandwidth represents what is achievable under export control constraints. The performance gap between the two reflects the manufacturing technology gap between TSMC-class and SMIC-class foundries.

Inference Architecture Market Trend. Ironwood is not the only inference-specialized chip entering the market. The broader trend toward inference-specific silicon reflects the industry's recognition that training and inference are fundamentally different workloads:

Inference-Focused ProductManufacturerMemoryAvailabilityApproach
TPU v7 IronwoodGoogle~192GBCloud-only (GCP)Custom ASIC, ICI interconnect
NVIDIA GB300 NVL72NVIDIA1,664GB (aggregate 72 GPUs)Purchase + cloudGPU rack-scale inference
Groq LPUGroq230MB SRAM (no HBM)Cloud APIUltra-low-latency ASIC
AWS Inferentia2Amazon32GB HBM2e per chipCloud-only (AWS)Custom inference ASIC
Intel Gaudi 3Intel128GB HBM2ePurchase + cloudInference-optimized accelerator

Ironwood's distinguishing factors in this landscape are the memory capacity per chip (192GB matches or exceeds most competitors), the pod scale (9,216 chips far exceeds any competitor's interconnected scale), and the JAX/XLA software co-design. The trade-off is complete vendor lock-in to Google Cloud.

Deployment Model Comparison. For organizations evaluating their inference infrastructure strategy, the choice between Ironwood, GPU cloud instances, and on-premises hardware involves multiple dimensions:

Decision FactorIronwood (GCP)GPU Cloud (Multi-provider)On-Premises GPU
Capital ExpenditureNoneNoneHigh
Operational FlexibilityGCP-onlyMulti-cloud possibleFull control
Scaling SpeedMinutes (if capacity available)Minutes-hoursWeeks-months
Data SovereigntyGCP data residencyProvider-dependentFull control
Vendor Lock-inHigh (Google Cloud + JAX)Moderate (CUDA portable)None
Cost PredictabilityCommitted-use = predictableOn-demand = variableFixed + power
Hardware RefreshAutomatic (Google manages)AutomaticManual procurement

What We Still Do Not Know. Several critical details about Ironwood remain undisclosed as of this writing:

  • Exact memory bandwidth - This determines single-chip inference throughput for memory-bound LLM workloads
  • Peak compute TFLOPS - Without this, performance projections are speculative
  • Power consumption - Affects operational costs and data center planning
  • Pricing model - Per-chip-hour, per-token, or a new pricing structure
  • JAX/XLA feature parity - Whether Ironwood supports all existing XLA operations or introduces new capabilities
  • Availability timeline - H2 2025 is broad; the difference between July and December matters for planning
  • Multi-tenancy support - Whether a single Ironwood chip can be shared across multiple users/models

Google's Strategic Position. Ironwood's announcement signals that Google views inference cost as the next battleground in cloud AI competition. By building purpose-built inference silicon, Google is betting that the marginal cost of serving a token will determine which cloud platform wins the AI API market. If Ironwood delivers on its architectural promise, Google Cloud could offer meaningfully lower per-token pricing than AWS or Azure - both of which depend on NVIDIA GPUs for inference. This would be a structural advantage that software optimization alone cannot replicate.

Cloud Pricing Context. While Ironwood pricing has not been announced, we can reason about where it might land based on Google's strategic incentives and the competitive landscape:

Pricing ScenarioPer-Chip-Hour (est.)Strategic Logic
Aggressive ($6-$8)Low marginDrive adoption, win inference market share from NVIDIA GPU clouds
Moderate ($8-$10)Sustainable marginBalance adoption with profitability
Premium ($10-$12)High marginCapture premium from 192GB memory advantage
Memory-proportional ($12-$15)Pro-rata vs Trillium6x memory = 6x price (unlikely but possible)

Google's history suggests aggressive pricing to drive adoption - they priced Trillium competitively with H100 instances and could follow the same playbook with Ironwood. The most likely scenario is moderate pricing ($8-$10/chip-hour on-demand) with committed-use discounts of 30-50%, targeting a per-token cost that undercuts GPU-based inference on any cloud.

Use Case Recommendations

Strong Fit:

  • Google Cloud customers serving large LLMs at scale. If you are running inference for 70B+ parameter models on GCP and serving millions of requests per day, Ironwood is designed specifically for your workload. The combination of 192GB memory, inference-optimized silicon, and ICI pod scale should deliver the lowest cost-per-token on any cloud platform.
  • Organizations building Gemini API-dependent applications. If your product relies on Google's Gemini API, you are already an Ironwood customer indirectly - Google serves Gemini on TPU infrastructure. As Ironwood ramps up, your API costs may decrease.
  • Research teams studying inference optimization. Ironwood's inference-first architecture provides a unique platform for studying attention optimization, KV cache management, speculative decoding, and other inference-specific techniques.
  • Multi-model serving with diverse model sizes. The 192GB per chip means a single Ironwood chip can serve multiple smaller models simultaneously (for example, a 7B model and a 13B model sharing one chip), simplifying multi-model deployments.

Weak Fit:

  • Organizations that need training capacity. Ironwood is inference-only. If you need to train models, you need Trillium or GPU instances separately. This means two hardware platforms if you do both training and inference on GCP.
  • Teams committed to PyTorch and CUDA. Ironwood runs JAX/XLA. If your entire inference stack is built on PyTorch with TensorRT-LLM or vLLM with CUDA backends, migrating to Ironwood requires significant re-engineering.
  • Multi-cloud or on-premises requirements. Ironwood is GCP-only. Organizations with regulatory requirements for multi-cloud redundancy or on-premises data processing cannot use Ironwood as their primary inference platform.
  • Small-scale inference deployments. If you are serving inference for a single model at modest throughput (thousands of requests per day rather than millions), the pod-scale advantages of Ironwood are irrelevant. A single GPU instance is simpler and sufficient.
  • Organizations that need pricing certainty now. Ironwood pricing has not been announced. Procurement decisions that require firm pricing commitments cannot include Ironwood until Google publishes rates.
  • Workloads requiring fine-grained GPU-level control. TPU programming through XLA is high-level and declarative. Teams that need precise control over memory layout, execution scheduling, or custom kernel operations will find GPU programming more flexible. The Pallas kernel language for TPUs exists but is less mature than CUDA.

Migration Path from Trillium to Ironwood. For existing Trillium users planning to adopt Ironwood for inference:

Migration StepEffortNotes
Code compatibilityLowJAX/XLA code should compile for both targets
Model checkpoint conversionLowSame format expected
Performance tuningMediumIronwood may benefit from different batch sizes, sequence bucketing
Infrastructure orchestrationMediumNew instance types, different slice configurations
Cost optimizationMediumNew pricing tiers, different committed-use thresholds
Workload routingMediumSplit training (Trillium) from inference (Ironwood)

The transition from Trillium to Ironwood for inference should be smoother than any GPU-to-TPU migration because both products share the same XLA compilation infrastructure. The main optimization work will be tuning batch sizes and serving configurations for Ironwood's different memory profile (192GB per chip vs 32GB).

Strengths

  • First TPU purpose-built for inference - silicon design decisions optimized for serving workloads
  • ~192GB HBM per chip enables single-chip deployment of 70B-class models
  • 9,216-chip pod scale with ICI interconnect - unmatched aggregate memory and bandwidth for inference farms
  • ~1.7PB aggregate pod memory can serve trillion-parameter models at extreme throughput
  • Google's vertical integration (silicon + interconnect + software + cloud) eliminates integration friction
  • Builds on Google's decade of TPU production experience and JAX/XLA compiler optimization
  • Powers Google's own Gemini inference - proven motivation to make the economics work

Weaknesses

  • Very limited specifications disclosed - most performance claims are based on Google's marketing, not independent data
  • Cloud-only availability means complete dependency on Google Cloud Platform
  • Inference-only design means you cannot use Ironwood for training workloads (need separate Trillium capacity)
  • JAX/XLA ecosystem lock-in - PyTorch workloads require non-trivial porting effort
  • Pricing unknown - could be expensive if Google prices to reflect the massive memory increase
  • No on-premises option for organizations with data sovereignty, compliance, or latency requirements
  • Pod scale of 9,216 chips is impressive but only available to Google's largest customers

Sources

Google TPU v7 Ironwood
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.