TL;DR

The largest chip ever manufactured - a single TSMC 5nm wafer with 4 trillion transistors and 900,000 AI-optimized cores
44GB of on-chip SRAM with 21 PB/s aggregate bandwidth - no HBM, no off-chip memory bottleneck for on-wafer data
Approximately 125 PFLOPS FP8 performance per CS-3 system - roughly 2x the WSE-2
Powers the CS-3 system at ~23kW, with Condor Galaxy clusters aggregating hundreds of nodes
CS-3 systems priced at approximately $2-3M, with cloud inference available through Cerebras Inference

Overview

The Cerebras Wafer-Scale Engine 3 is the most extreme piece of silicon in the AI hardware landscape. While every other accelerator in production - from NVIDIA's H100 to Groq's LPU to Intel's Gaudi 3 - starts with a standard die cut from a wafer, Cerebras skips the cutting step entirely. The WSE-3 is the wafer. A single 300mm TSMC 5nm wafer containing 4 trillion transistors, 900,000 AI-optimized compute cores, and 44GB of on-chip SRAM connected by a fabric running at 21 petabytes per second. There is nothing else like it in the semiconductor industry.

The architectural bet is straightforward: memory bandwidth is the bottleneck for AI workloads, and the fastest memory is the memory that lives on the same die as the compute. By making the die as large as physically possible - the entire wafer - Cerebras can pack enough SRAM next to enough compute cores to keep everything fed without ever touching external memory for on-wafer computations.

The WSE-3 delivers roughly 125 PFLOPS of FP8 performance per CS-3 system, which puts it in the same performance class as large GPU clusters but with a fundamentally different memory hierarchy. Where a GPU cluster moves data across PCIe lanes, NVLink bridges, and InfiniBand networks, the WSE-3 moves data across on-chip wires measured in millimeters.

The third generation WSE improves on the WSE-2 by moving from TSMC 7nm to 5nm, approximately doubling compute density while maintaining the wafer-scale form factor. The CS-3 system that houses the WSE-3 draws roughly 23kW and is designed to be rack-mounted in standard data center environments. Cerebras also offers inference as a service through Cerebras Inference, and has deployed the Condor Galaxy series of AI supercomputers built from clusters of CS-3 systems for partners including G42.

Key Specifications

Specification	Details
Manufacturer	Cerebras
Product Family	Wafer-Scale Engine (3rd generation)
Chip Type	Wafer-Scale Processor
Process Node	TSMC 5nm
Die Size	Full 300mm wafer (~46,225 mm2)
Transistors	4 trillion
AI Cores	900,000
On-Chip Memory	44GB SRAM
On-Chip Bandwidth	21 PB/s
External Memory	None on-wafer (MemoryX off-wafer for model weights)
FP16 Performance	~62 PFLOPS (per CS-3 system, estimated)
FP8 Performance	~125 PFLOPS (per CS-3 system)
System TDP	~23kW (CS-3 system)
Intra-Wafer Interconnect	On-wafer 2D mesh fabric
Inter-System Interconnect	SwarmX
External Storage	MemoryX (for model weights exceeding 44GB)
Target Workload	Training and Inference
System Price	~$2-3M (CS-3)
Release	Q1 2024

Performance Benchmarks

Metric	Cerebras CS-3 (WSE-3)	NVIDIA DGX H100 (8x H100)	NVIDIA DGX B200 (8x B200)	AMD MI300X (8-GPU)
FP8 PFLOPS (system)	~125	31.6	72	20.9
On-chip/HBM Memory	44GB SRAM	640GB HBM3	1,440GB HBM3e	1,536GB HBM3
Memory Bandwidth	21 PB/s	26.4 TB/s	64 TB/s	41.6 TB/s
System Power	~23kW	~10.2kW	~14.3kW	~10.4kW
Approx. System Price	~$2-3M	~$300-400K	~$400-500K (est.)	~$250-350K
LLM Training (relative)	High (sparse/dense)	Baseline	~2x baseline	~0.8x baseline
Inference Latency (single stream)	Very low	Low	Low	Low
PFLOPS per Watt	~5.4	~3.1	~5.0	~2.0

The comparison requires nuance. On raw FP8 FLOPS, the CS-3 has a 4x advantage over a DGX H100 and roughly 1.7x over a DGX B200. But the CS-3 costs 5-8x more than a DGX H100, and the 44GB of SRAM is a fraction of the 640GB-1,536GB HBM available on GPU systems. This means the WSE-3 excels at compute-bound workloads where data can be efficiently streamed from MemoryX, but the memory capacity limitation requires careful system design for very large models.

Bandwidth - The Real Story

The bandwidth numbers are where the WSE-3 is truly in a different league. 21 PB/s of on-chip bandwidth versus 26-64 TB/s of HBM bandwidth on GPU systems represents a 300-800x advantage.

This is not a perfectly apples-to-apples comparison - SRAM bandwidth and HBM bandwidth serve different roles in the memory hierarchy - but the practical impact is significant.

In LLM inference, the dominant cost is reading model weights from memory once per token generated. This is the "memory bandwidth bound" regime that limits GPU inference speed. On an H100, reading 70GB of model weights at 3,350 GB/s takes approximately 21ms per token - setting a hard floor on generation speed regardless of how many FLOPS the GPU has.

The WSE-3's 21 PB/s on-chip bandwidth reduces this floor dramatically for weights that reside in SRAM. For a model that fits entirely in the 44GB on-chip memory (roughly a 20B parameter model at FP16, or a 40B model at FP8), the bandwidth advantage translates directly into faster token generation.

For models that exceed 44GB - which includes essentially all production LLMs at FP16 precision - the MemoryX external storage system becomes the bandwidth bottleneck. Cerebras has not published MemoryX bandwidth specifications, which means the actual inference speed for large models depends on a bandwidth figure the company has not disclosed. This is a meaningful gap in the publicly available data.

Training Performance

For training workloads, the WSE-3's advantage is most pronounced on models that exhibit high compute-to-memory ratios. Dense matrix multiplications, convolutions, and attention computations that can be decomposed across the 900,000 cores benefit from the massive parallelism and on-chip bandwidth.

Cerebras has published training benchmarks showing competitive or superior performance to GPU clusters for models including GPT-3 (175B parameters), Llama 2 (70B), and various scientific computing workloads. The company reports near-linear scaling efficiency for data-parallel training, which suggests that the on-chip interconnect fabric provides sufficient bandwidth for gradient synchronization across the 900,000 cores.

The sparsity advantage is also significant. The WSE-3's fine-grained parallel architecture can exploit structured and unstructured sparsity more efficiently than GPUs, because each of the 900,000 cores can independently skip zero-valued computations without the warp-synchronization overhead that limits GPU sparsity performance. Cerebras claims 2-4x speedups on sparse workloads compared to dense equivalents.

Key Capabilities

Wafer-Scale Integration. The defining feature of the WSE-3 is that it is manufactured as a single piece of silicon. Standard chips are cut from a wafer, packaged individually, and then connected on circuit boards through PCIe slots, socket mounts, or interposers. The WSE-3 skips all of that - the entire wafer is the chip.

This eliminates the packaging overhead, inter-chip communication latency, and bandwidth constraints that limit traditional multi-chip systems. Data moving between two cores on the WSE-3 travels across on-wafer wires, not through PCIe traces, bridge chips, or network cables. The latency and bandwidth of on-wafer communication is orders of magnitude better than any multi-chip interconnect.

The engineering challenge is immense. A standard 300mm wafer contains billions of transistors, and even at 99.99% yield rates, a wafer with 4 trillion transistors would have millions of defective transistors. Cerebras builds redundancy into the design with spare cores that can be activated to replace any defective units - similar in concept to how SSDs over-provision flash cells to replace worn-out blocks.

The manufacturing process also requires custom wafer-scale packaging, power delivery, and cooling solutions that do not exist in conventional semiconductor production. A single wafer-scale chip needs uniform power delivery across a 300mm diameter surface and uniform cooling to prevent thermal hotspots from throttling performance.

The fact that Cerebras ships production CS-3 systems proves that wafer-scale integration is viable as a commercial technology. No other company has attempted this approach at production scale, which means Cerebras has a multi-year head start on the manufacturing learning curve.

MemoryX and SwarmX. The 44GB of on-chip SRAM is not enough to hold the parameters of large language models at full precision. A 70B parameter model in FP16 requires 140GB - more than 3x the on-chip capacity. A 405B model requires 810GB. Cerebras addresses this with MemoryX, an external memory subsystem that stores model weights and streams them to the WSE-3 during computation.

MemoryX uses high-bandwidth memory modules located outside the WSE-3 wafer but connected via a dedicated high-bandwidth link. The weights are streamed to the WSE-3 on demand, layer by layer, as the computation progresses through the model. The 44GB of on-chip SRAM serves as a working buffer for the current layer's activations and intermediate results, while MemoryX provides the backing store for the full model.

SwarmX is the interconnect fabric that enables multiple CS-3 systems to work together on a single training job. It handles gradient synchronization for data-parallel training and activation passing for pipeline-parallel training across CS-3 nodes.

Together, MemoryX and SwarmX allow the WSE-3 to train models with trillions of parameters despite the limited on-chip memory. The weights live in MemoryX, the activations and intermediate computations happen on the 44GB SRAM, and the 21 PB/s on-chip bandwidth ensures compute cores are never starved for data during the computation phase.

The MemoryX architecture makes the WSE-3 function as a massive compute engine with a streaming memory interface, rather than a self-contained processor. Model weights flow into the chip, computations happen entirely on-chip at 21 PB/s, and results flow back out. This is a fundamentally different execution model than GPUs, where weights, activations, and intermediate results all compete for the same HBM bandwidth.

Cerebras Inference. Cerebras offers inference as a service through their cloud platform, where the WSE-3's massive on-chip bandwidth translates to extremely fast token generation for LLM serving.

For models that fit in the 44GB SRAM (approximately a 20B parameter model at FP16, or a 40B model at FP8), inference runs entirely on-chip with no external memory bottleneck. Cerebras has demonstrated inference speeds exceeding 1,000 tokens per second on Llama-class models, competitive with Groq's LPU on single-stream latency.

For larger models that require MemoryX streaming, inference performance depends on the MemoryX-to-WSE bandwidth, which Cerebras has not published. This is the key unknown for evaluating Cerebras Inference on production-scale models.

The inference service is available through the Cerebras API with per-token pricing, providing cloud access to WSE-3 performance without requiring a $2-3M hardware purchase.

Condor Galaxy Supercomputers. Cerebras has deployed large-scale supercomputer installations built from clusters of CS-3 systems. The Condor Galaxy series, built in partnership with G42 (an Abu Dhabi-based technology company), combines hundreds of CS-3 nodes into multi-exaflop AI supercomputers.

Condor Galaxy 1 and 2 together provide 8 EXAFLOPS of combined AI compute - enough to train frontier-scale models from scratch. These installations demonstrate that the WSE-3 scales beyond individual systems into cluster-scale infrastructure.

For organizations that need supercomputer-class training capability but want an alternative to NVIDIA-dominated GPU clusters, the Condor Galaxy model provides a turnkey solution. G42, Mayo Clinic, and other Cerebras partners have used these clusters for large-scale model training and scientific computing workloads.

Architecture Deep Dive

The Scale of the WSE-3

To appreciate the WSE-3, consider the numbers in context:

Metric	WSE-3	NVIDIA H100	Ratio
Die area	~46,225 mm2	814 mm2	57x
Transistors	4 trillion	80 billion	50x
Compute cores	900,000	132 SMs	~6,800x
On-chip SRAM	44GB	~50MB (L2)	~880x
On-chip bandwidth	21 PB/s	~12 TB/s (L2)	~1,750x

The H100 is considered a large chip by industry standards. The WSE-3 is 57x larger. The number of independent compute cores - 900,000 - is more than the total number of CUDA cores across an entire 8-GPU DGX H100 node (8 x 16,896 = 135,168 CUDA cores).

The 2D Mesh Fabric

The 900,000 cores on the WSE-3 are connected by a 2D mesh interconnect fabric that spans the entire wafer. Each core can communicate directly with its neighbors in all four cardinal directions (north, south, east, west), and the mesh routing protocol handles multi-hop communication for cores that are not directly adjacent.

The mesh topology is well-suited for the communication patterns of neural network training:

Data-parallel gradient reduction: Reduce operations can be mapped to efficient tree-based communication patterns on the 2D mesh
Tensor parallelism: Adjacent cores can share activation data with minimal latency
Pipeline parallelism: Layers assigned to contiguous regions of the mesh can stream activations through the pipeline with near-zero communication overhead

The 21 PB/s aggregate bandwidth represents the total capacity of all mesh links on the wafer operating simultaneously. Individual core-to-core bandwidth is lower, but the aggregate throughput matters because neural network training involves distributed computation across all 900,000 cores simultaneously.

Yield and Redundancy

Semiconductor manufacturing is inherently imperfect. At TSMC's 5nm node, even with excellent process control, a wafer will have defective transistors. For a standard chip with a few hundred mm2 of die area, the probability of a defect is low enough that the majority of dies on a wafer are functional.

For the WSE-3, which uses the entire wafer, every defect that would have killed a standard die is present on the chip. Cerebras handles this through redundancy - the wafer includes more cores and more SRAM than the nominal specifications. Defective cores are identified during testing and replaced with spare cores from the redundancy pool. The mesh interconnect is also designed to route around defective links.

This redundancy approach has a cost: some fraction of the wafer's area is "wasted" on spare cores that are not used on defect-free wafers. But it enables the fundamental breakthrough of wafer-scale computing by making yield management tractable.

Pricing and Availability

The CS-3 system is priced at approximately $2-3 million per unit, making it a capital expenditure that only large enterprises, research institutions, and well-funded AI labs can justify.

System	Approx. Price	FP8 Performance	Price per PFLOPS
Cerebras CS-3 (WSE-3)	~$2.5M	~125 PFLOPS	~$20K/PFLOPS
NVIDIA DGX H100 (8x H100)	~$350K	31.6 PFLOPS	~$11K/PFLOPS
NVIDIA DGX B200 (8x B200)	~$450K	72 PFLOPS	~$6.3K/PFLOPS

On a pure price-per-PFLOPS basis, the CS-3 is more expensive than GPU alternatives. The NVIDIA DGX B200 is roughly 3x more cost-efficient per FLOP. The value proposition for the CS-3 lies elsewhere.

The 21 PB/s on-chip bandwidth advantage means that for bandwidth-bound workloads - which includes most LLM inference and many training scenarios - the effective performance per dollar can be significantly better than what the raw FLOPS comparison suggests. A GPU cluster with 4x fewer FLOPS but 800x less bandwidth may actually be slower on bandwidth-bound workloads than the CS-3.

For organizations evaluating the CS-3, the decision framework is:

If your workloads are compute-bound (large-batch training with high arithmetic intensity), GPUs offer better price-performance
If your workloads are bandwidth-bound (inference, small-batch training, sparse models), the WSE-3's architectural advantage closes the price-performance gap or tips it in Cerebras's favor

Cloud Access

Cerebras offers cloud access through multiple channels:

Cerebras Inference: Per-token API access for inference workloads
Condor Galaxy: Managed supercomputer access through partnerships (G42, others)
On-premises CS-3: Direct hardware purchase for organizations with the budget and data center infrastructure

For organizations that cannot justify a $2-3M hardware purchase, cloud access provides an alternative path to WSE-3 performance without the capital expenditure.

Who Should Consider the WSE-3

Strong fit:

Organizations training models at 100B+ parameter scale where bandwidth-bound performance matters
Research institutions working on sparse model architectures that benefit from fine-grained parallelism
AI labs that need an alternative to NVIDIA for supply chain diversification
Scientific computing workloads (drug discovery, climate modeling, molecular dynamics) that map well to the 900,000-core architecture
Organizations with the budget and data center infrastructure for $2-3M systems at 23kW power draw

Weak fit:

Startups or small teams that cannot justify multi-million dollar hardware purchases
Inference-only deployments where Groq's LPU offers better latency at lower cost
Teams that depend on the CUDA ecosystem and cannot invest in porting to Cerebras's SDK
Organizations that need broad cloud availability or multi-provider deployment flexibility
Workloads where the model fits comfortably in GPU HBM and bandwidth is not the bottleneck

Strengths

21 PB/s on-chip bandwidth is 300-800x higher than any HBM-based solution
900,000 cores on a single wafer eliminates inter-chip communication overhead
~125 PFLOPS FP8 per system is 4x a DGX H100 in a single unit
Wafer-scale redundancy engineering proves full-wafer integration is a viable production technology
MemoryX and SwarmX allow scaling to trillion-parameter models despite limited on-chip SRAM
Supports both training and inference - not limited to one or the other
Cerebras Inference service provides cloud access without the $2-3M hardware purchase
~5.4 PFLOPS per watt is among the best energy efficiency ratios in the market

Weaknesses

$2-3M per CS-3 system limits the buyer pool to well-funded organizations
44GB on-chip SRAM is a fraction of the 640GB-1,536GB HBM available on GPU systems
Price-per-PFLOPS is roughly 2-3x worse than current-generation NVIDIA GPU systems
MemoryX external storage adds latency and complexity for models exceeding 44GB
Software ecosystem is far less mature than CUDA - smaller developer community and fewer examples
Single-vendor dependency with no second-source option for the wafer-scale form factor
23kW system power requires significant data center cooling and power infrastructure

Groq LPU - Another SRAM-based architecture, optimized purely for inference with deterministic latency
Intel Gaudi 3 - A more conventional ASIC approach to challenging NVIDIA's GPU dominance
AWS Trainium2 - Amazon's cloud-native alternative for large-scale training workloads

Cerebras WSE-3 - The Wafer-Scale AI Engine

Overview

Key Specifications

Performance Benchmarks

Bandwidth - The Real Story

Training Performance

Key Capabilities

Architecture Deep Dive

The Scale of the WSE-3

The 2D Mesh Fabric

Yield and Redundancy

Pricing and Availability

Cloud Access

Who Should Consider the WSE-3

Strengths

Weaknesses

Sources

Overview

Key Specifications

Performance Benchmarks

Bandwidth - The Real Story

Training Performance

Key Capabilities

Architecture Deep Dive

The Scale of the WSE-3

The 2D Mesh Fabric

Yield and Redundancy

Pricing and Availability

Cloud Access

Who Should Consider the WSE-3

Strengths

Weaknesses

Related Coverage

Sources

Google Analytics