TL;DR

Google's eighth-generation inference chip: 10.1 FP4 PFLOPs and 288GB HBM3e at 8,601 GB/s
384MB on-chip SRAM - 3x the TPU v7 Ironwood - enables larger KV caches for long-context reasoning
Boardfly topology reduces maximum network diameter by 56% vs standard 3D torus for lower all-to-all latency
80% better performance-per-dollar for inference vs previous generation; 5x lower collective operation latency

Overview

The Google TPU 8i is the inference half of Google's eighth-generation TPU program, announced at Google Cloud Next on April 22, 2026 with the training-focused TPU 8t. The separation of training and inference into two dedicated chips is new for the TPU line, and the 8i shows exactly what Google focused on when it could make inference-specific design choices: more on-chip memory, a topology built for low-latency collective operations, and host CPU integration that reduces orchestration overhead.

Per chip, the 8i delivers 10.1 FP4 petaFLOPS with 288GB of HBM3e running at 8,601 GB/s - more memory and more bandwidth per chip than the 8t's 6,528 GB/s, because inference is memory-bandwidth-bound in a way that training isn't. The 384MB of on-chip SRAM is especially significant: three times what the Ironwood carried, and enough to hold much larger KV caches in-SRAM during multi-step reasoning and agent chains.

At the pod level, 1,152 TPU 8i chips form a single system image delivering 11.6 FP8 ExaFLOPS. Google uses this configuration for what it calls the "agentic era" - not batch throughput maximization, but fast response times for multi-step agent workflows where latency compounds across each reasoning hop.

Key Specifications

Specification	Details
Manufacturer	Google
Product Family	8th Generation TPU
Chip Type	TPU (ASIC)
Process Node	TSMC N3
Memory	288GB HBM3e
Memory Bandwidth	8,601 GB/s
FP4 Performance	10.1 PFLOPs per chip
FP8 Performance	Not disclosed
On-Chip SRAM	384 MB (3x previous gen)
TDP	Not disclosed
Inter-Chip Interconnect (ICI)	19.2 Tb/s (2x vs prior gen)
Network Topology	Boardfly (high-radix, max 7 hops)
Collective Latency	5x reduction vs prior topology
Pod Scale	1,152 chips, 11.6 FP8 ExaFLOPS
Host CPU	Dual Axion ARM (NUMA per server)
Cooling	4th-generation liquid cooling
Release Date	2026-H2

Performance Benchmarks

Google positions the 8i as a 80% improvement in price-performance over the Ironwood TPU for inference at low-latency targets. The comparison points that matter most for inference are memory bandwidth, on-chip SRAM, and collective latency - not raw TFLOPS.

Metric	Google TPU 8i	Google TPU v7 Ironwood	Groq LPU	Notes
FP4 TFLOPS per chip	10,100	Not disclosed	N/A	Ironwood specs not public
FP8 TFLOPS per chip	Not disclosed	Not disclosed	~429,000	Groq TFLOPS not directly comparable
HBM Capacity	288 GB HBM3e	192 GB (est.)	None (SRAM only)
Memory Bandwidth	8,601 GB/s	Not disclosed	~80 TB/s	Groq uses SRAM, different regime
On-chip SRAM	384 MB	~128 MB (est.)	230 MB
Inference Price-Performance	1.8x vs Ironwood	(baseline)	Competitive	Google stated 80% improvement
Network Latency	5x lower (CAE)	(baseline)	Near-zero	Different architectures

The Groq LPU comparison deserves a note: Groq's numbers aren't directly comparable because it uses SRAM rather than HBM, operates in a different throughput/latency tradeoff space, and serves different model size ranges. For very large models (100B+), the TPU 8i's 288GB HBM gives it a clear capacity advantage. For fast single-stream latency on smaller models, Groq remains the benchmark.

TPU 8i ASIC block diagram showing HBM3e, SRAM, and Collectives Acceleration Engine TPU 8i chip block diagram with Collectives Acceleration Engine (CAE), expanded SRAM, and HBM3e stacks. Source: cloud.google.com

Key Capabilities

Boardfly Topology. The TPU 8i uses a different network topology than the 8t. Instead of a 3D torus (which the training chip uses for its different communication patterns), the 8i uses "Boardfly" - a high-radix topology that reduces the maximum number of hops between any two chips to seven. That's a 56% reduction in network diameter versus the previous generation's torus topology. For inference workloads, all-to-all communication (KV cache sharing across chips, attention head distribution, prefill/decode coordination) controls the communication profile. Fewer hops means lower latency at every step of an agentic chain.

Collectives Acceleration Engine. The CAE is an on-chip hardware block dedicated to collective operations - reduce, all-reduce, broadcast, scatter, gather. These are the communication primitives that inference serving frameworks use constantly to coordinate across chips. On the Ironwood, these operations ran on the main compute pipeline. On the 8i, they run on dedicated hardware with a claimed 5x reduction in collective operation latency. For a multi-step reasoning chain with dozens of communication rounds, this compounds clearly.

384MB SRAM and KV Cache. The three-fold increase in on-chip SRAM directly enables larger KV caches in-SRAM. KV cache is the memory that stores attention keys and values during autoregressive token generation - it grows with context length and number of concurrent requests. Keeping more of it in SRAM rather than HBM reduces the memory access latency for each token generation step. For long-context reasoning (64K+ tokens) or batch serving with many concurrent sessions, this matters.

TPU 8i Boardfly topology - hierarchical inference network diagram The Boardfly topology used by the TPU 8i, showing maximum seven-hop ICI network diameter. Source: cloud.google.com

Pricing and Availability

No pricing has been published. The TPU 8i is a Google Cloud service and won't be available for on-premises deployment or through other cloud providers. Availability is scheduled for later in 2026.

For inference specifically, Google is competing against NVIDIA's inference offerings (including via the NVIDIA Groq 3 LPU partnership), Groq directly, and cloud providers reselling GPU capacity. The 80% price-performance improvement claim over Ironwood means little without an absolute price anchor - organizations assessing the 8i will need to wait for Google to publish per-chip-hour pricing before building a TCO model.

Strengths and Weaknesses

Strengths

288GB HBM3e at 8,601 GB/s per chip - more memory than AMD MI300X at higher bandwidth
384MB on-chip SRAM (3x Ironwood) enables much larger in-SRAM KV caches
Boardfly topology: 56% reduction in max network diameter, 5x lower collective latency
80% better inference price-performance vs Ironwood
Dual Axion ARM hosts per server for low-overhead orchestration
Native support for vLLM, SGLang, JAX, and PyTorch - no custom framework required

Weaknesses

Cloud-only: no on-premises deployment option
FP8 TFLOPS not disclosed - limits direct comparison with GPU-based inference
TDP not disclosed
Training workloads require the separate TPU 8t
Locked to Google Cloud availability schedule and pricing

Google TPU 8t - The companion training chip with superpod-scale architecture
Google TPU v7 Ironwood - The predecessor inference chip
Groq LPU - The primary alternative for ultra-low latency inference
NVIDIA Groq 3 LPU - The NVIDIA-integrated LPU inference accelerator

Google TPU 8i - Low-Latency Inference for Agent Era

Overview

Key Specifications

Performance Benchmarks

Key Capabilities

Pricing and Availability

Strengths and Weaknesses

Strengths

Weaknesses

Sources

Overview

Key Specifications

Performance Benchmarks

Key Capabilities

Pricing and Availability

Strengths and Weaknesses

Strengths

Weaknesses

Related Coverage

Sources