Hardware

AWS Trainium3 - Amazon's 3nm AI Accelerator

Complete specs, benchmarks, and analysis of AWS Trainium3 - Amazon's TSMC 3nm AI chip with 2.52 PFLOPS FP8, 144GB HBM3e, and NeuronLink-v4, powering Anthropic's Claude through Project Rainier.

AWS Trainium3 - Amazon's 3nm AI Accelerator

TL;DR

  • AWS's fourth-generation custom AI chip on TSMC 3nm (N3P) - 4.4x more compute than Trainium2 with 2.52 PFLOPS MXFP8 per chip
  • 8 NeuronCore-v4 engines per chip, each with 128x128 BF16 and 512x128 MXFP8 systolic arrays plus 32 MiB of dedicated SRAM
  • 144GB HBM3e at 4.9 TB/s bandwidth with 2 TB/s NeuronLink-v4 interconnect between chips
  • Ships as EC2 Trn3 UltraServers - Gen1 (64 chips, air-cooled) and Gen2 (144 chips, liquid-cooled, ~362 PFLOPS FP8)
  • Anthropic is the anchor customer via Project Rainier - nearly 1 million Trainium chips training and serving Claude

Overview

AWS Trainium3 is Amazon's statement that it does not need NVIDIA to build the infrastructure for frontier AI. Announced at re:Invent 2025 and generally available since December 2025, Trainium3 is the fourth generation of AWS custom AI silicon (following Inferentia, Trainium, and Trainium2). It is fabricated on TSMC's 3nm N3P process - the most advanced node used in any shipping AI accelerator - and delivers 2.52 PFLOPS of MXFP8 compute per chip with 144GB of HBM3e memory.

The chip matters because of who's using it. Anthropic - the AI safety company behind Claude - is the anchor customer for AWS's custom silicon program. Through Project Rainier, Anthropic has launched nearly 1 million Trainium chips (primarily Trainium2, now transitioning to Trainium3) across AWS data centers in Indiana. Anthropic reportedly provided direct input into Trainium3's design, focusing on training speed, inference latency, and energy efficiency. When the company building one of the top-3 frontier models designs its training infrastructure around your chip, that's the strongest possible validation of the silicon.

The Trainium3 architecture centers on the NeuronCore-v4 - AWS's fourth-generation compute engine. Each chip contains 8 NeuronCore-v4 units, each packing a BF16 systolic array (128x128), a MXFP8/MXFP4 systolic array (512x128), a vector engine with accelerated exponential computation, and 32 MiB of dedicated SRAM. The total per-chip SRAM is 256 MiB - enough to hold significant attention intermediates on-chip without round-tripping to HBM.

The chip connects to its neighbors via NeuronLink-v4, a proprietary interconnect delivering 2 TB/s per device. Within an UltraServer, all chips connect through NeuronSwitch-v1 fabric switches in an all-to-all topology - a significant upgrade over Trainium2's 4x4x4 3D torus. This switch-based topology is specifically optimized for Mixture-of-Experts (MoE) communication patterns, where any-to-any routing between expert shards is critical for performance.

Key Specifications

SpecificationDetails
ManufacturerAWS (Amazon)
ArchitectureTrainium3 (4th generation)
Process NodeTSMC 3nm (N3P)
NeuronCores per Chip8 (NeuronCore-v4)
MXFP8/MXFP4 Compute2.52 PFLOPS per chip
BF16/FP16/TF32 Compute~671 TFLOPS per chip
FP32 Compute~183 TFLOPS per chip
Per-NeuronCore MXFP8315 TFLOPS
Per-NeuronCore BF1679 TFLOPS
Memory144 GB HBM3e
Memory Bandwidth4.9 TB/s
On-chip SRAM32 MiB per NeuronCore (256 MiB total)
NeuronLink-v42 TB/s per device
Supported Data TypesFP32, TF32, BF16, FP16, MXFP8, MXFP4, INT8, INT16, INT32
Sparsity SupportStructured: 4:16, 4:12, 4:8, 2:8, 2:4, 1:4, 1:2
TDP~500W per chip
CoolingAir (Gen1) or liquid (Gen2)
AvailabilityCloud-only (AWS EC2)
GA DateDecember 2, 2025

NeuronCore-v4 Architecture

Each NeuronCore-v4 contains:

  • Tensor Engine: Dual systolic arrays - a 128x128 BF16 array and a 512x128 MXFP8/MXFP4 array. Peak throughput is 315 MXFP8 TFLOPS and 79 BF16 TFLOPS per core.
  • Vector Engine: 1.2 TFLOPS FP32 with 4x faster exponential computation (critical for softmax in attention). Handles MXFP8 quantization from BF16/FP16 on-chip.
  • Scalar Engine: 1.2 TFLOPS FP32 for scalar operations.
  • GPSIMD Engine: 8 fully-programmable 512-bit vector processors for custom C/C++ kernels with direct SRAM access.
  • SRAM: 32 MiB SBUF (scratch buffer) plus 2 MiB PSUM (partial sum) per core, with near-memory accumulation via DMA.

The GPSIMD engine is worth calling out. It's a set of fully programmable processors that allow developers to write custom C/C++ kernels that execute directly on the NeuronCore with access to the chip's SRAM. This is exposed through the Neuron Kernel Interface (NKI), which AWS has open-sourced under Apache 2.0. The GPSIMD effectively gives Trainium3 the kernel programmability of a GPU while maintaining the efficiency of a fixed-function accelerator for standard operations.

Performance Benchmarks

Trainium3 vs Trainium2

MetricTrainium2Trainium3Improvement
Compute per chip (FP8)~1.26 PFLOPS2.52 PFLOPS2x
HBM Capacity96 GB HBM3144 GB HBM3e1.5x
Memory Bandwidth~2.9 TB/s4.9 TB/s1.7x
On-chip SRAM per NeuronCore28 MiB32 MiB+14%
NeuronLink Bandwidth~1 TB/s2 TB/s2x
UltraServer Compute~81 PFLOPS~362 PFLOPS (Gen2)4.4x
UltraServer Memory~6.1 TB~20.7 TB (Gen2)3.4x
Performance per Watt1x~4x4x

Trainium3 vs NVIDIA and Google

MetricTrainium3GB300 NVL72 (per GPU)TPU v7 Ironwood
Process NodeTSMC 3nm (N3P)TSMC 4NPNot disclosed
FP8 per Chip2.52 PFLOPS~5.0 PFLOPS4.6 PFLOPS
HBM per Chip144 GB288 GB192 GB
HBM Bandwidth4.9 TB/s8 TB/s7.4 TB/s
Interconnect BW2 TB/s (NeuronLink)1.8 TB/s (NVLink 5)~1.2 TB/s (ICI)
Rack FP8~362 PFLOPS (Gen2)~360 PFLOPS42.5 EXAFLOPS (Pod)
Rack HBM~20.7 TB (Gen2)~20.7 TB~1.77 PB (Pod)
SoftwareNeuron SDKCUDAXLA + JAX/TF

Rack-Scale Parity with GB300 NVL72

At the UltraServer level, Trainium3 Gen2 hits rough FP8 parity with the GB300 NVL72 through a different strategy: more chips at lower per-chip performance. The Gen2 UltraServer packs 144 Trainium3 chips (at 2.52 PFLOPS each = ~362 PFLOPS total) versus the GB300's 72 GPUs (at ~5 PFLOPS each = ~360 PFLOPS total). Both configurations deliver about 20.7 TB of HBM. The Trainium3 UltraServer actually leads in aggregate memory bandwidth (705.6 TB/s vs 576 TB/s) and interconnect bandwidth (288 TB/s aggregate NeuronLink vs 130 TB/s NVLink).

The trade-off is per-chip performance. Each NVIDIA B300 GPU has roughly 2x the compute and bandwidth of a single Trainium3 chip, which means workloads that benefit from strong single-chip performance (small-batch inference, single-device debugging) run faster on NVIDIA hardware. Trainium3 compensates at scale by having 2x more chips per rack, but this requires good distributed scaling efficiency to fully utilize.

Per SemiAnalysis analysis, the Trainium3 UltraServer delivers around 30% better TCO per marketed FP8 PFLOPS compared to the GB300 NVL72. This cost advantage is AWS's primary competitive argument.

Key Capabilities

NeuronSwitch-v1 All-to-All Topology. The most architecturally significant improvement over Trainium2 is the interconnect topology. Trainium2 used a 4x4x4 3D torus, where each chip could only communicate directly with its immediate neighbors. This created latency imbalances for communication patterns that do not map neatly to a torus topology.

Trainium3 UltraServers use NeuronSwitch-v1 fabric switches to provide all-to-all connectivity between all chips. In the Gen2 UltraServer (144 chips), any chip can communicate with any other chip at NeuronLink speeds without multi-hop routing. This is critical for Mixture-of-Experts models (like Mixtral and DeepSeek-V3), where the all-to-all communication pattern during expert routing creates communication hot spots on torus topologies. With NeuronSwitch, every expert-to-expert transfer takes the same amount of time regardless of which chips hold the expert shards.

The sub-10-microsecond inter-chip latency enables tight synchronization between chips for tensor-parallel training, where each chip holds a shard of every layer and must synchronize at every layer boundary.

UltraCluster 3.0 - Million-Chip Scale. AWS connects thousands of UltraServers into UltraCluster 3.0, which can contain up to 1 million Trainium3 chips in a single flat network domain via petabit-scale non-blocking Elastic Fabric Adapter (EFA) networking. This is the largest announced AI cluster scale from any provider. For context, Google's TPU v7 Ironwood Pods top out at 9,216 chips per Pod.

The million-chip scale is designed for training frontier models with trillions of parameters, where the data parallelism dimension requires tens of thousands of chips. Anthropic's Project Rainier is the reference deployment, with reportedly close to 1 million Trainium chips (mostly Trainium2 today, transitioning to Trainium3).

Neuron SDK and Open-Source NKI Compiler. The Neuron SDK (version 2.27.0+) provides the software stack for Trainium3:

  • PyTorch Integration: Native backend via TorchNeuron with eager mode for debugging and distributed APIs (FSDP, DTensor). Full PyTorch 2.9 support with native integration planned for PyTorch 2.10.
  • JAX Support: Full JAX 0.7 support for JAX-native training workloads.
  • NKI (Neuron Kernel Interface): The NKI compiler is open-sourced under Apache 2.0, built on MLIR. It exposes the GPSIMD engines for custom kernel development with pre-optimized kernels for attention, MLP, and normalization.
  • vLLM Integration: Available through the vLLM-Neuron Plugin for inference serving.

The open-source NKI compiler is a competitive response to NVIDIA's proprietary CUDA ecosystem. By publishing the compiler and kernel interface, AWS enables the community to optimize Trainium performance and reduces the vendor lock-in concern that has historically favored CUDA.

Pricing and Availability

Trainium3 is cloud-only - AWS does not sell chips or servers. The hardware is available as EC2 Trn3 UltraServer instances.

UltraServer SKUChipsHBMFP8 ComputeCoolingStatus
Gen1 (NL32x2 Switched)64~9.2 TB~161 PFLOPSAir-cooledGA (Dec 2025)
Gen2 (NL72x2 Switched)144~20.7 TB~362 PFLOPSLiquid-cooledGA (Dec 2025)

Cost Claims

AWS hasn't published standard on-demand hourly pricing for Trn3 instances. The UltraServers are mostly offered via Capacity Blocks and long-term contracts rather than standard per-instance pricing. AWS and its customers have published the following cost comparisons:

ClaimSource
~30% better TCO per FP8 PFLOPS vs GB300 NVL72SemiAnalysis
Up to 50% cost reduction vs GPU alternativesAWS customers (Anthropic, Karakuri, others)
30-40% better price-performance vs P5e/P5en instancesAWS
4x faster inference at half the costDecart (video generation)
Over 50% reduction in LLM training costsKarakuri (Japanese LLM)

For reference, Trainium2 long-term contracts reportedly brought effective pricing to approximately $0.50/hour per chip. Trainium3 pricing with capacity commitments is expected to be in a similar range per-TFLOPS, meaning the 2x compute improvement translates to roughly 2x better price-performance.

Customer Adoption

Anthropic leads the Trainium customer base. Through Project Rainier, Anthropic has launched nearly 1 million Trainium chips - the largest single customer deployment of non-NVIDIA AI silicon. Amazon has invested $8 billion in Anthropic, and the Trainium partnership is central to that investment. Project Rainier provides Anthropic about 5x the compute used to train earlier Claude versions.

Other confirmed customers include Karakuri (Japan's most accurate Japanese language model), Decart (real-time video generation), Metagenomi (genomics AI), Poolside (AI applications), and partnerships with Hugging Face (Optimum Neuron integration) and Red Hat.

Strengths

  • 2.52 PFLOPS MXFP8 per chip on TSMC 3nm - the most advanced process node in any shipping AI accelerator
  • Gen2 UltraServer hits rack-scale FP8 parity with GB300 NVL72 at reported 30% better TCO
  • NeuronSwitch-v1 all-to-all topology optimized for MoE communication patterns - major improvement over Trn2's 3D torus
  • UltraCluster 3.0 scales to 1 million chips in a single flat network - the largest announced AI cluster architecture
  • Anthropic's near-million-chip deployment verifies the silicon for frontier model training and inference
  • GPSIMD engines with open-source NKI compiler provide GPU-like kernel programmability
  • 4x better performance-per-watt than Trainium2 addresses datacenter power constraints
  • Native PyTorch and JAX support with active framework integration roadmap
  • 144GB HBM3e with 4.9 TB/s bandwidth matches the capacity and bandwidth tier of leading GPUs

Weaknesses

  • Cloud-only availability - cannot purchase hardware for on-premises deployment
  • Per-chip compute (2.52 PFLOPS FP8) is roughly half of NVIDIA B300 (~5 PFLOPS) and Google TPU v7 (4.6 PFLOPS)
  • Neuron SDK ecosystem is notably less mature than CUDA - fewer optimized models and kernels available
  • No FP4 native support (MXFP4 provides the same TFLOPS as MXFP8) - NVIDIA leads at FP4 with 3x advantage
  • Capacity Blocks and long-term contracts required for most access - limited on-demand availability
  • Heavily dependent on Anthropic as anchor customer - concentrated customer risk
  • AWS lock-in - workloads optimized for Neuron SDK cannot easily migrate to other cloud providers
  • NeuronSwitch-v1 is first-generation - real-world reliability and performance at million-chip scale are unproven

Sources

AWS Trainium3 - Amazon's 3nm AI Accelerator
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.