TL;DR

AWS's fourth-generation custom AI chip on TSMC 3nm (N3P) - 4.4x more compute than Trainium2 with 2.52 PFLOPS MXFP8 per chip
8 NeuronCore-v4 engines per chip, each with 128x128 BF16 and 512x128 MXFP8 systolic arrays plus 32 MiB of dedicated SRAM
144GB HBM3e at 4.9 TB/s bandwidth with 2 TB/s NeuronLink-v4 interconnect between chips
Ships as EC2 Trn3 UltraServers - Gen1 (64 chips, air-cooled) and Gen2 (144 chips, liquid-cooled, ~362 PFLOPS FP8)
Anthropic is the anchor customer via Project Rainier - nearly 1 million Trainium chips training and serving Claude

Overview

AWS Trainium3 is Amazon's statement that it does not need NVIDIA to build the infrastructure for frontier AI. Announced at re:Invent 2025 and generally available since December 2025, Trainium3 is the fourth generation of AWS custom AI silicon (following Inferentia, Trainium, and Trainium2). It is fabricated on TSMC's 3nm N3P process - the most advanced node used in any shipping AI accelerator - and delivers 2.52 PFLOPS of MXFP8 compute per chip with 144GB of HBM3e memory.

The chip matters because of who's using it. Anthropic - the AI safety company behind Claude - is the anchor customer for AWS's custom silicon program. Through Project Rainier, Anthropic has launched nearly 1 million Trainium chips (primarily Trainium2, now transitioning to Trainium3) across AWS data centers in Indiana. Anthropic reportedly provided direct input into Trainium3's design, focusing on training speed, inference latency, and energy efficiency. When the company building one of the top-3 frontier models designs its training infrastructure around your chip, that's the strongest possible validation of the silicon.

The Trainium3 architecture centers on the NeuronCore-v4 - AWS's fourth-generation compute engine. Each chip contains 8 NeuronCore-v4 units, each packing a BF16 systolic array (128x128), a MXFP8/MXFP4 systolic array (512x128), a vector engine with accelerated exponential computation, and 32 MiB of dedicated SRAM. The total per-chip SRAM is 256 MiB - enough to hold significant attention intermediates on-chip without round-tripping to HBM.

The chip connects to its neighbors via NeuronLink-v4, a proprietary interconnect delivering 2 TB/s per device. Within an UltraServer, all chips connect through NeuronSwitch-v1 fabric switches in an all-to-all topology - a significant upgrade over Trainium2's 4x4x4 3D torus. This switch-based topology is specifically optimized for Mixture-of-Experts (MoE) communication patterns, where any-to-any routing between expert shards is critical for performance.

Key Specifications

Specification	Details
Manufacturer	AWS (Amazon)
Architecture	Trainium3 (4th generation)
Process Node	TSMC 3nm (N3P)
NeuronCores per Chip	8 (NeuronCore-v4)
MXFP8/MXFP4 Compute	2.52 PFLOPS per chip
BF16/FP16/TF32 Compute	~671 TFLOPS per chip
FP32 Compute	~183 TFLOPS per chip
Per-NeuronCore MXFP8	315 TFLOPS
Per-NeuronCore BF16	79 TFLOPS
Memory	144 GB HBM3e
Memory Bandwidth	4.9 TB/s
On-chip SRAM	32 MiB per NeuronCore (256 MiB total)
NeuronLink-v4	2 TB/s per device
Supported Data Types	FP32, TF32, BF16, FP16, MXFP8, MXFP4, INT8, INT16, INT32
Sparsity Support	Structured: 4:16, 4:12, 4:8, 2:8, 2:4, 1:4, 1:2
TDP	~500W per chip
Cooling	Air (Gen1) or liquid (Gen2)
Availability	Cloud-only (AWS EC2)
GA Date	December 2, 2025

NeuronCore-v4 Architecture

Each NeuronCore-v4 contains:

Tensor Engine: Dual systolic arrays - a 128x128 BF16 array and a 512x128 MXFP8/MXFP4 array. Peak throughput is 315 MXFP8 TFLOPS and 79 BF16 TFLOPS per core.
Vector Engine: 1.2 TFLOPS FP32 with 4x faster exponential computation (critical for softmax in attention). Handles MXFP8 quantization from BF16/FP16 on-chip.
Scalar Engine: 1.2 TFLOPS FP32 for scalar operations.
GPSIMD Engine: 8 fully-programmable 512-bit vector processors for custom C/C++ kernels with direct SRAM access.
SRAM: 32 MiB SBUF (scratch buffer) plus 2 MiB PSUM (partial sum) per core, with near-memory accumulation via DMA.

The GPSIMD engine is worth calling out. It's a set of fully programmable processors that allow developers to write custom C/C++ kernels that execute directly on the NeuronCore with access to the chip's SRAM. This is exposed through the Neuron Kernel Interface (NKI), which AWS has open-sourced under Apache 2.0. The GPSIMD effectively gives Trainium3 the kernel programmability of a GPU while maintaining the efficiency of a fixed-function accelerator for standard operations.

Performance Benchmarks

Trainium3 vs Trainium2

Metric	Trainium2	Trainium3	Improvement
Compute per chip (FP8)	~1.26 PFLOPS	2.52 PFLOPS	2x
HBM Capacity	96 GB HBM3	144 GB HBM3e	1.5x
Memory Bandwidth	~2.9 TB/s	4.9 TB/s	1.7x
On-chip SRAM per NeuronCore	28 MiB	32 MiB	+14%
NeuronLink Bandwidth	~1 TB/s	2 TB/s	2x
UltraServer Compute	~81 PFLOPS	~362 PFLOPS (Gen2)	4.4x
UltraServer Memory	~6.1 TB	~20.7 TB (Gen2)	3.4x
Performance per Watt	1x	~4x	4x

Trainium3 vs NVIDIA and Google

Metric	Trainium3	GB300 NVL72 (per GPU)	TPU v7 Ironwood
Process Node	TSMC 3nm (N3P)	TSMC 4NP	Not disclosed
FP8 per Chip	2.52 PFLOPS	~5.0 PFLOPS	4.6 PFLOPS
HBM per Chip	144 GB	288 GB	192 GB
HBM Bandwidth	4.9 TB/s	8 TB/s	7.4 TB/s
Interconnect BW	2 TB/s (NeuronLink)	1.8 TB/s (NVLink 5)	~1.2 TB/s (ICI)
Rack FP8	~362 PFLOPS (Gen2)	~360 PFLOPS	42.5 EXAFLOPS (Pod)
Rack HBM	~20.7 TB (Gen2)	~20.7 TB	~1.77 PB (Pod)
Software	Neuron SDK	CUDA	XLA + JAX/TF

Rack-Scale Parity with GB300 NVL72

At the UltraServer level, Trainium3 Gen2 hits rough FP8 parity with the GB300 NVL72 through a different strategy: more chips at lower per-chip performance. The Gen2 UltraServer packs 144 Trainium3 chips (at 2.52 PFLOPS each = ~362 PFLOPS total) versus the GB300's 72 GPUs (at ~5 PFLOPS each = ~360 PFLOPS total). Both configurations deliver about 20.7 TB of HBM. The Trainium3 UltraServer actually leads in aggregate memory bandwidth (705.6 TB/s vs 576 TB/s) and interconnect bandwidth (288 TB/s aggregate NeuronLink vs 130 TB/s NVLink).

The trade-off is per-chip performance. Each NVIDIA B300 GPU has roughly 2x the compute and bandwidth of a single Trainium3 chip, which means workloads that benefit from strong single-chip performance (small-batch inference, single-device debugging) run faster on NVIDIA hardware. Trainium3 compensates at scale by having 2x more chips per rack, but this requires good distributed scaling efficiency to fully utilize.

Per SemiAnalysis analysis, the Trainium3 UltraServer delivers around 30% better TCO per marketed FP8 PFLOPS compared to the GB300 NVL72. This cost advantage is AWS's primary competitive argument.

Key Capabilities

NeuronSwitch-v1 All-to-All Topology. The most architecturally significant improvement over Trainium2 is the interconnect topology. Trainium2 used a 4x4x4 3D torus, where each chip could only communicate directly with its immediate neighbors. This created latency imbalances for communication patterns that do not map neatly to a torus topology.

Trainium3 UltraServers use NeuronSwitch-v1 fabric switches to provide all-to-all connectivity between all chips. In the Gen2 UltraServer (144 chips), any chip can communicate with any other chip at NeuronLink speeds without multi-hop routing. This is critical for Mixture-of-Experts models (like Mixtral and DeepSeek-V3), where the all-to-all communication pattern during expert routing creates communication hot spots on torus topologies. With NeuronSwitch, every expert-to-expert transfer takes the same amount of time regardless of which chips hold the expert shards.

The sub-10-microsecond inter-chip latency enables tight synchronization between chips for tensor-parallel training, where each chip holds a shard of every layer and must synchronize at every layer boundary.

UltraCluster 3.0 - Million-Chip Scale. AWS connects thousands of UltraServers into UltraCluster 3.0, which can contain up to 1 million Trainium3 chips in a single flat network domain via petabit-scale non-blocking Elastic Fabric Adapter (EFA) networking. This is the largest announced AI cluster scale from any provider. For context, Google's TPU v7 Ironwood Pods top out at 9,216 chips per Pod.

The million-chip scale is designed for training frontier models with trillions of parameters, where the data parallelism dimension requires tens of thousands of chips. Anthropic's Project Rainier is the reference deployment, with reportedly close to 1 million Trainium chips (mostly Trainium2 today, transitioning to Trainium3).

Neuron SDK and Open-Source NKI Compiler. The Neuron SDK (version 2.27.0+) provides the software stack for Trainium3:

PyTorch Integration: Native backend via TorchNeuron with eager mode for debugging and distributed APIs (FSDP, DTensor). Full PyTorch 2.9 support with native integration planned for PyTorch 2.10.
JAX Support: Full JAX 0.7 support for JAX-native training workloads.
NKI (Neuron Kernel Interface): The NKI compiler is open-sourced under Apache 2.0, built on MLIR. It exposes the GPSIMD engines for custom kernel development with pre-optimized kernels for attention, MLP, and normalization.
vLLM Integration: Available through the vLLM-Neuron Plugin for inference serving.

The open-source NKI compiler is a competitive response to NVIDIA's proprietary CUDA ecosystem. By publishing the compiler and kernel interface, AWS enables the community to optimize Trainium performance and reduces the vendor lock-in concern that has historically favored CUDA.

Pricing and Availability

Trainium3 is cloud-only - AWS does not sell chips or servers. The hardware is available as EC2 Trn3 UltraServer instances.

UltraServer SKU	Chips	HBM	FP8 Compute	Cooling	Status
Gen1 (NL32x2 Switched)	64	~9.2 TB	~161 PFLOPS	Air-cooled	GA (Dec 2025)
Gen2 (NL72x2 Switched)	144	~20.7 TB	~362 PFLOPS	Liquid-cooled	GA (Dec 2025)

Cost Claims

AWS hasn't published standard on-demand hourly pricing for Trn3 instances. The UltraServers are mostly offered via Capacity Blocks and long-term contracts rather than standard per-instance pricing. AWS and its customers have published the following cost comparisons:

Claim	Source
~30% better TCO per FP8 PFLOPS vs GB300 NVL72	SemiAnalysis
Up to 50% cost reduction vs GPU alternatives	AWS customers (Anthropic, Karakuri, others)
30-40% better price-performance vs P5e/P5en instances	AWS
4x faster inference at half the cost	Decart (video generation)
Over 50% reduction in LLM training costs	Karakuri (Japanese LLM)

For reference, Trainium2 long-term contracts reportedly brought effective pricing to approximately $0.50/hour per chip. Trainium3 pricing with capacity commitments is expected to be in a similar range per-TFLOPS, meaning the 2x compute improvement translates to roughly 2x better price-performance.

Customer Adoption

Anthropic leads the Trainium customer base. Through Project Rainier, Anthropic has launched nearly 1 million Trainium chips - the largest single customer deployment of non-NVIDIA AI silicon. Amazon has invested $8 billion in Anthropic, and the Trainium partnership is central to that investment. Project Rainier provides Anthropic about 5x the compute used to train earlier Claude versions.

Other confirmed customers include Karakuri (Japan's most accurate Japanese language model), Decart (real-time video generation), Metagenomi (genomics AI), Poolside (AI applications), and partnerships with Hugging Face (Optimum Neuron integration) and Red Hat.

Strengths

2.52 PFLOPS MXFP8 per chip on TSMC 3nm - the most advanced process node in any shipping AI accelerator
Gen2 UltraServer hits rack-scale FP8 parity with GB300 NVL72 at reported 30% better TCO
NeuronSwitch-v1 all-to-all topology optimized for MoE communication patterns - major improvement over Trn2's 3D torus
UltraCluster 3.0 scales to 1 million chips in a single flat network - the largest announced AI cluster architecture
Anthropic's near-million-chip deployment verifies the silicon for frontier model training and inference
GPSIMD engines with open-source NKI compiler provide GPU-like kernel programmability
4x better performance-per-watt than Trainium2 addresses datacenter power constraints
Native PyTorch and JAX support with active framework integration roadmap
144GB HBM3e with 4.9 TB/s bandwidth matches the capacity and bandwidth tier of leading GPUs

Weaknesses

Cloud-only availability - cannot purchase hardware for on-premises deployment
Per-chip compute (2.52 PFLOPS FP8) is roughly half of NVIDIA B300 (~5 PFLOPS) and Google TPU v7 (4.6 PFLOPS)
Neuron SDK ecosystem is notably less mature than CUDA - fewer optimized models and kernels available
No FP4 native support (MXFP4 provides the same TFLOPS as MXFP8) - NVIDIA leads at FP4 with 3x advantage
Capacity Blocks and long-term contracts required for most access - limited on-demand availability
Heavily dependent on Anthropic as anchor customer - concentrated customer risk
AWS lock-in - workloads optimized for Neuron SDK cannot easily migrate to other cloud providers
NeuronSwitch-v1 is first-generation - real-world reliability and performance at million-chip scale are unproven

AWS Trainium2 - Amazon's Custom AI Chip - The previous generation that Trainium3 succeeds
NVIDIA GB300 NVL72 - Blackwell Ultra Rack - The primary competitive system at rack scale
Google TPU v7 Ironwood - Google's competing custom accelerator
Intel Gaudi 3 - Intel's AI accelerator competing in the cloud AI space
Cerebras WSE-3 - Wafer-Scale AI Engine - Another non-GPU approach to AI training at scale

AWS Trainium3 - Amazon's 3nm AI Accelerator

Overview

Key Specifications

NeuronCore-v4 Architecture

Performance Benchmarks

Trainium3 vs Trainium2

Trainium3 vs NVIDIA and Google

Rack-Scale Parity with GB300 NVL72

Key Capabilities

Pricing and Availability

Cost Claims

Customer Adoption

Strengths

Weaknesses

Sources

Overview

Key Specifications

NeuronCore-v4 Architecture

Performance Benchmarks

Trainium3 vs Trainium2

Trainium3 vs NVIDIA and Google

Rack-Scale Parity with GB300 NVL72

Key Capabilities

Pricing and Availability

Cost Claims

Customer Adoption

Strengths

Weaknesses

Related Coverage

Sources

Google Analytics