Google TPU 8i - Low-Latency Inference for Agent Era

Google's TPU 8i is a purpose-built inference chip with 10.1 FP4 PFLOPs, 288GB HBM3e at 8,601 GB/s, and a Boardfly topology that cuts collective latency 5x for agentic AI workloads.

Google TPU 8i - Low-Latency Inference for Agent Era

TL;DR

  • Google's eighth-generation inference chip: 10.1 FP4 PFLOPs and 288GB HBM3e at 8,601 GB/s
  • 384MB on-chip SRAM - 3x the TPU v7 Ironwood - enables larger KV caches for long-context reasoning
  • Boardfly topology reduces maximum network diameter by 56% vs standard 3D torus for lower all-to-all latency
  • 80% better performance-per-dollar for inference vs previous generation; 5x lower collective operation latency

Overview

The Google TPU 8i is the inference half of Google's eighth-generation TPU program, announced at Google Cloud Next on April 22, 2026 with the training-focused TPU 8t. The separation of training and inference into two dedicated chips is new for the TPU line, and the 8i shows exactly what Google focused on when it could make inference-specific design choices: more on-chip memory, a topology built for low-latency collective operations, and host CPU integration that reduces orchestration overhead.

Per chip, the 8i delivers 10.1 FP4 petaFLOPS with 288GB of HBM3e running at 8,601 GB/s - more memory and more bandwidth per chip than the 8t's 6,528 GB/s, because inference is memory-bandwidth-bound in a way that training isn't. The 384MB of on-chip SRAM is especially significant: three times what the Ironwood carried, and enough to hold much larger KV caches in-SRAM during multi-step reasoning and agent chains.

At the pod level, 1,152 TPU 8i chips form a single system image delivering 11.6 FP8 ExaFLOPS. Google uses this configuration for what it calls the "agentic era" - not batch throughput maximization, but fast response times for multi-step agent workflows where latency compounds across each reasoning hop.

Key Specifications

SpecificationDetails
ManufacturerGoogle
Product Family8th Generation TPU
Chip TypeTPU (ASIC)
Process NodeTSMC N3
Memory288GB HBM3e
Memory Bandwidth8,601 GB/s
FP4 Performance10.1 PFLOPs per chip
FP8 PerformanceNot disclosed
On-Chip SRAM384 MB (3x previous gen)
TDPNot disclosed
Inter-Chip Interconnect (ICI)19.2 Tb/s (2x vs prior gen)
Network TopologyBoardfly (high-radix, max 7 hops)
Collective Latency5x reduction vs prior topology
Pod Scale1,152 chips, 11.6 FP8 ExaFLOPS
Host CPUDual Axion ARM (NUMA per server)
Cooling4th-generation liquid cooling
Release Date2026-H2

Performance Benchmarks

Google positions the 8i as a 80% improvement in price-performance over the Ironwood TPU for inference at low-latency targets. The comparison points that matter most for inference are memory bandwidth, on-chip SRAM, and collective latency - not raw TFLOPS.

MetricGoogle TPU 8iGoogle TPU v7 IronwoodGroq LPUNotes
FP4 TFLOPS per chip10,100Not disclosedN/AIronwood specs not public
FP8 TFLOPS per chipNot disclosedNot disclosed~429,000Groq TFLOPS not directly comparable
HBM Capacity288 GB HBM3e192 GB (est.)None (SRAM only)
Memory Bandwidth8,601 GB/sNot disclosed~80 TB/sGroq uses SRAM, different regime
On-chip SRAM384 MB~128 MB (est.)230 MB
Inference Price-Performance1.8x vs Ironwood(baseline)CompetitiveGoogle stated 80% improvement
Network Latency5x lower (CAE)(baseline)Near-zeroDifferent architectures

The Groq LPU comparison deserves a note: Groq's numbers aren't directly comparable because it uses SRAM rather than HBM, operates in a different throughput/latency tradeoff space, and serves different model size ranges. For very large models (100B+), the TPU 8i's 288GB HBM gives it a clear capacity advantage. For fast single-stream latency on smaller models, Groq remains the benchmark.


TPU 8i ASIC block diagram showing HBM3e, SRAM, and Collectives Acceleration Engine TPU 8i chip block diagram with Collectives Acceleration Engine (CAE), expanded SRAM, and HBM3e stacks. Source: cloud.google.com

Key Capabilities

Boardfly Topology. The TPU 8i uses a different network topology than the 8t. Instead of a 3D torus (which the training chip uses for its different communication patterns), the 8i uses "Boardfly" - a high-radix topology that reduces the maximum number of hops between any two chips to seven. That's a 56% reduction in network diameter versus the previous generation's torus topology. For inference workloads, all-to-all communication (KV cache sharing across chips, attention head distribution, prefill/decode coordination) controls the communication profile. Fewer hops means lower latency at every step of an agentic chain.

Collectives Acceleration Engine. The CAE is an on-chip hardware block dedicated to collective operations - reduce, all-reduce, broadcast, scatter, gather. These are the communication primitives that inference serving frameworks use constantly to coordinate across chips. On the Ironwood, these operations ran on the main compute pipeline. On the 8i, they run on dedicated hardware with a claimed 5x reduction in collective operation latency. For a multi-step reasoning chain with dozens of communication rounds, this compounds clearly.

384MB SRAM and KV Cache. The three-fold increase in on-chip SRAM directly enables larger KV caches in-SRAM. KV cache is the memory that stores attention keys and values during autoregressive token generation - it grows with context length and number of concurrent requests. Keeping more of it in SRAM rather than HBM reduces the memory access latency for each token generation step. For long-context reasoning (64K+ tokens) or batch serving with many concurrent sessions, this matters.


TPU 8i Boardfly topology - hierarchical inference network diagram The Boardfly topology used by the TPU 8i, showing maximum seven-hop ICI network diameter. Source: cloud.google.com

Pricing and Availability

No pricing has been published. The TPU 8i is a Google Cloud service and won't be available for on-premises deployment or through other cloud providers. Availability is scheduled for later in 2026.

For inference specifically, Google is competing against NVIDIA's inference offerings (including via the NVIDIA Groq 3 LPU partnership), Groq directly, and cloud providers reselling GPU capacity. The 80% price-performance improvement claim over Ironwood means little without an absolute price anchor - organizations assessing the 8i will need to wait for Google to publish per-chip-hour pricing before building a TCO model.

Strengths and Weaknesses

Strengths

  • 288GB HBM3e at 8,601 GB/s per chip - more memory than AMD MI300X at higher bandwidth
  • 384MB on-chip SRAM (3x Ironwood) enables much larger in-SRAM KV caches
  • Boardfly topology: 56% reduction in max network diameter, 5x lower collective latency
  • 80% better inference price-performance vs Ironwood
  • Dual Axion ARM hosts per server for low-overhead orchestration
  • Native support for vLLM, SGLang, JAX, and PyTorch - no custom framework required

Weaknesses

  • Cloud-only: no on-premises deployment option
  • FP8 TFLOPS not disclosed - limits direct comparison with GPU-based inference
  • TDP not disclosed
  • Training workloads require the separate TPU 8t
  • Locked to Google Cloud availability schedule and pricing

Sources

✓ Last verified May 1, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.