AWS Trainium3 - Amazon's 3nm AI Accelerator
Complete specs, benchmarks, and analysis of AWS Trainium3 - Amazon's TSMC 3nm AI chip with 2.52 PFLOPS FP8, 144GB HBM3e, and NeuronLink-v4, powering Anthropic's Claude through Project Rainier.

TL;DR
- AWS's fourth-generation custom AI chip on TSMC 3nm (N3P) - 4.4x more compute than Trainium2 with 2.52 PFLOPS MXFP8 per chip
- 8 NeuronCore-v4 engines per chip, each with 128x128 BF16 and 512x128 MXFP8 systolic arrays plus 32 MiB of dedicated SRAM
- 144GB HBM3e at 4.9 TB/s bandwidth with 2 TB/s NeuronLink-v4 interconnect between chips
- Ships as EC2 Trn3 UltraServers - Gen1 (64 chips, air-cooled) and Gen2 (144 chips, liquid-cooled, ~362 PFLOPS FP8)
- Anthropic is the anchor customer via Project Rainier - nearly 1 million Trainium chips training and serving Claude
Overview
AWS Trainium3 is Amazon's statement that it does not need NVIDIA to build the infrastructure for frontier AI. Announced at re:Invent 2025 and generally available since December 2025, Trainium3 is the fourth generation of AWS custom AI silicon (following Inferentia, Trainium, and Trainium2). It is fabricated on TSMC's 3nm N3P process - the most advanced node used in any shipping AI accelerator - and delivers 2.52 PFLOPS of MXFP8 compute per chip with 144GB of HBM3e memory.
The chip matters because of who's using it. Anthropic - the AI safety company behind Claude - is the anchor customer for AWS's custom silicon program. Through Project Rainier, Anthropic has launched nearly 1 million Trainium chips (primarily Trainium2, now transitioning to Trainium3) across AWS data centers in Indiana. Anthropic reportedly provided direct input into Trainium3's design, focusing on training speed, inference latency, and energy efficiency. When the company building one of the top-3 frontier models designs its training infrastructure around your chip, that's the strongest possible validation of the silicon.
The Trainium3 architecture centers on the NeuronCore-v4 - AWS's fourth-generation compute engine. Each chip contains 8 NeuronCore-v4 units, each packing a BF16 systolic array (128x128), a MXFP8/MXFP4 systolic array (512x128), a vector engine with accelerated exponential computation, and 32 MiB of dedicated SRAM. The total per-chip SRAM is 256 MiB - enough to hold significant attention intermediates on-chip without round-tripping to HBM.
The chip connects to its neighbors via NeuronLink-v4, a proprietary interconnect delivering 2 TB/s per device. Within an UltraServer, all chips connect through NeuronSwitch-v1 fabric switches in an all-to-all topology - a significant upgrade over Trainium2's 4x4x4 3D torus. This switch-based topology is specifically optimized for Mixture-of-Experts (MoE) communication patterns, where any-to-any routing between expert shards is critical for performance.
Key Specifications
| Specification | Details |
|---|---|
| Manufacturer | AWS (Amazon) |
| Architecture | Trainium3 (4th generation) |
| Process Node | TSMC 3nm (N3P) |
| NeuronCores per Chip | 8 (NeuronCore-v4) |
| MXFP8/MXFP4 Compute | 2.52 PFLOPS per chip |
| BF16/FP16/TF32 Compute | ~671 TFLOPS per chip |
| FP32 Compute | ~183 TFLOPS per chip |
| Per-NeuronCore MXFP8 | 315 TFLOPS |
| Per-NeuronCore BF16 | 79 TFLOPS |
| Memory | 144 GB HBM3e |
| Memory Bandwidth | 4.9 TB/s |
| On-chip SRAM | 32 MiB per NeuronCore (256 MiB total) |
| NeuronLink-v4 | 2 TB/s per device |
| Supported Data Types | FP32, TF32, BF16, FP16, MXFP8, MXFP4, INT8, INT16, INT32 |
| Sparsity Support | Structured: 4:16, 4:12, 4:8, 2:8, 2:4, 1:4, 1:2 |
| TDP | ~500W per chip |
| Cooling | Air (Gen1) or liquid (Gen2) |
| Availability | Cloud-only (AWS EC2) |
| GA Date | December 2, 2025 |
NeuronCore-v4 Architecture
Each NeuronCore-v4 contains:
- Tensor Engine: Dual systolic arrays - a 128x128 BF16 array and a 512x128 MXFP8/MXFP4 array. Peak throughput is 315 MXFP8 TFLOPS and 79 BF16 TFLOPS per core.
- Vector Engine: 1.2 TFLOPS FP32 with 4x faster exponential computation (critical for softmax in attention). Handles MXFP8 quantization from BF16/FP16 on-chip.
- Scalar Engine: 1.2 TFLOPS FP32 for scalar operations.
- GPSIMD Engine: 8 fully-programmable 512-bit vector processors for custom C/C++ kernels with direct SRAM access.
- SRAM: 32 MiB SBUF (scratch buffer) plus 2 MiB PSUM (partial sum) per core, with near-memory accumulation via DMA.
The GPSIMD engine is worth calling out. It's a set of fully programmable processors that allow developers to write custom C/C++ kernels that execute directly on the NeuronCore with access to the chip's SRAM. This is exposed through the Neuron Kernel Interface (NKI), which AWS has open-sourced under Apache 2.0. The GPSIMD effectively gives Trainium3 the kernel programmability of a GPU while maintaining the efficiency of a fixed-function accelerator for standard operations.
Performance Benchmarks
Trainium3 vs Trainium2
| Metric | Trainium2 | Trainium3 | Improvement |
|---|---|---|---|
| Compute per chip (FP8) | ~1.26 PFLOPS | 2.52 PFLOPS | 2x |
| HBM Capacity | 96 GB HBM3 | 144 GB HBM3e | 1.5x |
| Memory Bandwidth | ~2.9 TB/s | 4.9 TB/s | 1.7x |
| On-chip SRAM per NeuronCore | 28 MiB | 32 MiB | +14% |
| NeuronLink Bandwidth | ~1 TB/s | 2 TB/s | 2x |
| UltraServer Compute | ~81 PFLOPS | ~362 PFLOPS (Gen2) | 4.4x |
| UltraServer Memory | ~6.1 TB | ~20.7 TB (Gen2) | 3.4x |
| Performance per Watt | 1x | ~4x | 4x |
Trainium3 vs NVIDIA and Google
| Metric | Trainium3 | GB300 NVL72 (per GPU) | TPU v7 Ironwood |
|---|---|---|---|
| Process Node | TSMC 3nm (N3P) | TSMC 4NP | Not disclosed |
| FP8 per Chip | 2.52 PFLOPS | ~5.0 PFLOPS | 4.6 PFLOPS |
| HBM per Chip | 144 GB | 288 GB | 192 GB |
| HBM Bandwidth | 4.9 TB/s | 8 TB/s | 7.4 TB/s |
| Interconnect BW | 2 TB/s (NeuronLink) | 1.8 TB/s (NVLink 5) | ~1.2 TB/s (ICI) |
| Rack FP8 | ~362 PFLOPS (Gen2) | ~360 PFLOPS | 42.5 EXAFLOPS (Pod) |
| Rack HBM | ~20.7 TB (Gen2) | ~20.7 TB | ~1.77 PB (Pod) |
| Software | Neuron SDK | CUDA | XLA + JAX/TF |
Rack-Scale Parity with GB300 NVL72
At the UltraServer level, Trainium3 Gen2 hits rough FP8 parity with the GB300 NVL72 through a different strategy: more chips at lower per-chip performance. The Gen2 UltraServer packs 144 Trainium3 chips (at 2.52 PFLOPS each = ~362 PFLOPS total) versus the GB300's 72 GPUs (at ~5 PFLOPS each = ~360 PFLOPS total). Both configurations deliver about 20.7 TB of HBM. The Trainium3 UltraServer actually leads in aggregate memory bandwidth (705.6 TB/s vs 576 TB/s) and interconnect bandwidth (288 TB/s aggregate NeuronLink vs 130 TB/s NVLink).
The trade-off is per-chip performance. Each NVIDIA B300 GPU has roughly 2x the compute and bandwidth of a single Trainium3 chip, which means workloads that benefit from strong single-chip performance (small-batch inference, single-device debugging) run faster on NVIDIA hardware. Trainium3 compensates at scale by having 2x more chips per rack, but this requires good distributed scaling efficiency to fully utilize.
Per SemiAnalysis analysis, the Trainium3 UltraServer delivers around 30% better TCO per marketed FP8 PFLOPS compared to the GB300 NVL72. This cost advantage is AWS's primary competitive argument.
Key Capabilities
NeuronSwitch-v1 All-to-All Topology. The most architecturally significant improvement over Trainium2 is the interconnect topology. Trainium2 used a 4x4x4 3D torus, where each chip could only communicate directly with its immediate neighbors. This created latency imbalances for communication patterns that do not map neatly to a torus topology.
Trainium3 UltraServers use NeuronSwitch-v1 fabric switches to provide all-to-all connectivity between all chips. In the Gen2 UltraServer (144 chips), any chip can communicate with any other chip at NeuronLink speeds without multi-hop routing. This is critical for Mixture-of-Experts models (like Mixtral and DeepSeek-V3), where the all-to-all communication pattern during expert routing creates communication hot spots on torus topologies. With NeuronSwitch, every expert-to-expert transfer takes the same amount of time regardless of which chips hold the expert shards.
The sub-10-microsecond inter-chip latency enables tight synchronization between chips for tensor-parallel training, where each chip holds a shard of every layer and must synchronize at every layer boundary.
UltraCluster 3.0 - Million-Chip Scale. AWS connects thousands of UltraServers into UltraCluster 3.0, which can contain up to 1 million Trainium3 chips in a single flat network domain via petabit-scale non-blocking Elastic Fabric Adapter (EFA) networking. This is the largest announced AI cluster scale from any provider. For context, Google's TPU v7 Ironwood Pods top out at 9,216 chips per Pod.
The million-chip scale is designed for training frontier models with trillions of parameters, where the data parallelism dimension requires tens of thousands of chips. Anthropic's Project Rainier is the reference deployment, with reportedly close to 1 million Trainium chips (mostly Trainium2 today, transitioning to Trainium3).
Neuron SDK and Open-Source NKI Compiler. The Neuron SDK (version 2.27.0+) provides the software stack for Trainium3:
- PyTorch Integration: Native backend via TorchNeuron with eager mode for debugging and distributed APIs (FSDP, DTensor). Full PyTorch 2.9 support with native integration planned for PyTorch 2.10.
- JAX Support: Full JAX 0.7 support for JAX-native training workloads.
- NKI (Neuron Kernel Interface): The NKI compiler is open-sourced under Apache 2.0, built on MLIR. It exposes the GPSIMD engines for custom kernel development with pre-optimized kernels for attention, MLP, and normalization.
- vLLM Integration: Available through the vLLM-Neuron Plugin for inference serving.
The open-source NKI compiler is a competitive response to NVIDIA's proprietary CUDA ecosystem. By publishing the compiler and kernel interface, AWS enables the community to optimize Trainium performance and reduces the vendor lock-in concern that has historically favored CUDA.
Pricing and Availability
Trainium3 is cloud-only - AWS does not sell chips or servers. The hardware is available as EC2 Trn3 UltraServer instances.
| UltraServer SKU | Chips | HBM | FP8 Compute | Cooling | Status |
|---|---|---|---|---|---|
| Gen1 (NL32x2 Switched) | 64 | ~9.2 TB | ~161 PFLOPS | Air-cooled | GA (Dec 2025) |
| Gen2 (NL72x2 Switched) | 144 | ~20.7 TB | ~362 PFLOPS | Liquid-cooled | GA (Dec 2025) |
Cost Claims
AWS hasn't published standard on-demand hourly pricing for Trn3 instances. The UltraServers are mostly offered via Capacity Blocks and long-term contracts rather than standard per-instance pricing. AWS and its customers have published the following cost comparisons:
| Claim | Source |
|---|---|
| ~30% better TCO per FP8 PFLOPS vs GB300 NVL72 | SemiAnalysis |
| Up to 50% cost reduction vs GPU alternatives | AWS customers (Anthropic, Karakuri, others) |
| 30-40% better price-performance vs P5e/P5en instances | AWS |
| 4x faster inference at half the cost | Decart (video generation) |
| Over 50% reduction in LLM training costs | Karakuri (Japanese LLM) |
For reference, Trainium2 long-term contracts reportedly brought effective pricing to approximately $0.50/hour per chip. Trainium3 pricing with capacity commitments is expected to be in a similar range per-TFLOPS, meaning the 2x compute improvement translates to roughly 2x better price-performance.
Customer Adoption
Anthropic leads the Trainium customer base. Through Project Rainier, Anthropic has launched nearly 1 million Trainium chips - the largest single customer deployment of non-NVIDIA AI silicon. Amazon has invested $8 billion in Anthropic, and the Trainium partnership is central to that investment. Project Rainier provides Anthropic about 5x the compute used to train earlier Claude versions.
Other confirmed customers include Karakuri (Japan's most accurate Japanese language model), Decart (real-time video generation), Metagenomi (genomics AI), Poolside (AI applications), and partnerships with Hugging Face (Optimum Neuron integration) and Red Hat.
Strengths
- 2.52 PFLOPS MXFP8 per chip on TSMC 3nm - the most advanced process node in any shipping AI accelerator
- Gen2 UltraServer hits rack-scale FP8 parity with GB300 NVL72 at reported 30% better TCO
- NeuronSwitch-v1 all-to-all topology optimized for MoE communication patterns - major improvement over Trn2's 3D torus
- UltraCluster 3.0 scales to 1 million chips in a single flat network - the largest announced AI cluster architecture
- Anthropic's near-million-chip deployment verifies the silicon for frontier model training and inference
- GPSIMD engines with open-source NKI compiler provide GPU-like kernel programmability
- 4x better performance-per-watt than Trainium2 addresses datacenter power constraints
- Native PyTorch and JAX support with active framework integration roadmap
- 144GB HBM3e with 4.9 TB/s bandwidth matches the capacity and bandwidth tier of leading GPUs
Weaknesses
- Cloud-only availability - cannot purchase hardware for on-premises deployment
- Per-chip compute (2.52 PFLOPS FP8) is roughly half of NVIDIA B300 (~5 PFLOPS) and Google TPU v7 (4.6 PFLOPS)
- Neuron SDK ecosystem is notably less mature than CUDA - fewer optimized models and kernels available
- No FP4 native support (MXFP4 provides the same TFLOPS as MXFP8) - NVIDIA leads at FP4 with 3x advantage
- Capacity Blocks and long-term contracts required for most access - limited on-demand availability
- Heavily dependent on Anthropic as anchor customer - concentrated customer risk
- AWS lock-in - workloads optimized for Neuron SDK cannot easily migrate to other cloud providers
- NeuronSwitch-v1 is first-generation - real-world reliability and performance at million-chip scale are unproven
Related Coverage
- AWS Trainium2 - Amazon's Custom AI Chip - The previous generation that Trainium3 succeeds
- NVIDIA GB300 NVL72 - Blackwell Ultra Rack - The primary competitive system at rack scale
- Google TPU v7 Ironwood - Google's competing custom accelerator
- Intel Gaudi 3 - Intel's AI accelerator competing in the cloud AI space
- Cerebras WSE-3 - Wafer-Scale AI Engine - Another non-GPU approach to AI training at scale
Sources
- AWS EC2 Trn3 UltraServers - Official Product Page
- AWS Trainium3 UltraServer Delivers Faster AI Training at Lower Cost - About Amazon
- AWS Trainium3 Deep Dive - A Potential Inflection Point - SemiAnalysis
- NeuronCore-v4 Architecture - AWS Neuron Documentation
- Trn3 Architecture - AWS Neuron Documentation
- With Trainium4, AWS Will Crank Up Everything But The Clocks - The Next Platform
- Amazon Launches Trainium3 AI Accelerator - Tom's Hardware
- Amazon Releases an Impressive New AI Chip - TechCrunch
- AWS Activates Project Rainier - About Amazon
- AWS Makes Trainium3 UltraServers Generally Available - DataCenterDynamics
