Hardware

Cerebras WSE-3 - The Wafer-Scale AI Engine

The Cerebras Wafer-Scale Engine 3 is the largest chip ever built - an entire TSMC 5nm wafer with 900,000 AI cores, 44GB of on-chip SRAM, and 21 PB/s of memory bandwidth powering the CS-3 AI supercomputer.

Cerebras WSE-3 - The Wafer-Scale AI Engine

TL;DR

  • The largest chip ever manufactured - a single TSMC 5nm wafer with 4 trillion transistors and 900,000 AI-optimized cores
  • 44GB of on-chip SRAM with 21 PB/s aggregate bandwidth - no HBM, no off-chip memory bottleneck for on-wafer data
  • Approximately 125 PFLOPS FP8 performance per CS-3 system - roughly 2x the WSE-2
  • Powers the CS-3 system at ~23kW, with Condor Galaxy clusters aggregating hundreds of nodes
  • CS-3 systems priced at approximately $2-3M, with cloud inference available through Cerebras Inference

Overview

The Cerebras Wafer-Scale Engine 3 is the most extreme piece of silicon in the AI hardware landscape. While every other accelerator in production - from NVIDIA's H100 to Groq's LPU to Intel's Gaudi 3 - starts with a standard die cut from a wafer, Cerebras skips the cutting step entirely. The WSE-3 is the wafer. A single 300mm TSMC 5nm wafer containing 4 trillion transistors, 900,000 AI-optimized compute cores, and 44GB of on-chip SRAM connected by a fabric running at 21 petabytes per second. There is nothing else like it in the semiconductor industry.

The architectural bet is straightforward: memory bandwidth is the bottleneck for AI workloads, and the fastest memory is the memory that lives on the same die as the compute. By making the die as large as physically possible - the entire wafer - Cerebras can pack enough SRAM next to enough compute cores to keep everything fed without ever touching external memory for on-wafer computations.

The WSE-3 delivers roughly 125 PFLOPS of FP8 performance per CS-3 system, which puts it in the same performance class as large GPU clusters but with a fundamentally different memory hierarchy. Where a GPU cluster moves data across PCIe lanes, NVLink bridges, and InfiniBand networks, the WSE-3 moves data across on-chip wires measured in millimeters.

The third generation WSE improves on the WSE-2 by moving from TSMC 7nm to 5nm, approximately doubling compute density while maintaining the wafer-scale form factor. The CS-3 system that houses the WSE-3 draws roughly 23kW and is designed to be rack-mounted in standard data center environments. Cerebras also offers inference as a service through Cerebras Inference, and has deployed the Condor Galaxy series of AI supercomputers built from clusters of CS-3 systems for partners including G42.

Key Specifications

SpecificationDetails
ManufacturerCerebras
Product FamilyWafer-Scale Engine (3rd generation)
Chip TypeWafer-Scale Processor
Process NodeTSMC 5nm
Die SizeFull 300mm wafer (~46,225 mm2)
Transistors4 trillion
AI Cores900,000
On-Chip Memory44GB SRAM
On-Chip Bandwidth21 PB/s
External MemoryNone on-wafer (MemoryX off-wafer for model weights)
FP16 Performance~62 PFLOPS (per CS-3 system, estimated)
FP8 Performance~125 PFLOPS (per CS-3 system)
System TDP~23kW (CS-3 system)
Intra-Wafer InterconnectOn-wafer 2D mesh fabric
Inter-System InterconnectSwarmX
External StorageMemoryX (for model weights exceeding 44GB)
Target WorkloadTraining and Inference
System Price~$2-3M (CS-3)
ReleaseQ1 2024

Performance Benchmarks

MetricCerebras CS-3 (WSE-3)NVIDIA DGX H100 (8x H100)NVIDIA DGX B200 (8x B200)AMD MI300X (8-GPU)
FP8 PFLOPS (system)~12531.67220.9
On-chip/HBM Memory44GB SRAM640GB HBM31,440GB HBM3e1,536GB HBM3
Memory Bandwidth21 PB/s26.4 TB/s64 TB/s41.6 TB/s
System Power~23kW~10.2kW~14.3kW~10.4kW
Approx. System Price~$2-3M~$300-400K~$400-500K (est.)~$250-350K
LLM Training (relative)High (sparse/dense)Baseline~2x baseline~0.8x baseline
Inference Latency (single stream)Very lowLowLowLow
PFLOPS per Watt~5.4~3.1~5.0~2.0

The comparison requires nuance. On raw FP8 FLOPS, the CS-3 has a 4x advantage over a DGX H100 and roughly 1.7x over a DGX B200. But the CS-3 costs 5-8x more than a DGX H100, and the 44GB of SRAM is a fraction of the 640GB-1,536GB HBM available on GPU systems. This means the WSE-3 excels at compute-bound workloads where data can be efficiently streamed from MemoryX, but the memory capacity limitation requires careful system design for very large models.

Bandwidth - The Real Story

The bandwidth numbers are where the WSE-3 is truly in a different league. 21 PB/s of on-chip bandwidth versus 26-64 TB/s of HBM bandwidth on GPU systems represents a 300-800x advantage.

This is not a perfectly apples-to-apples comparison - SRAM bandwidth and HBM bandwidth serve different roles in the memory hierarchy - but the practical impact is significant.

In LLM inference, the dominant cost is reading model weights from memory once per token generated. This is the "memory bandwidth bound" regime that limits GPU inference speed. On an H100, reading 70GB of model weights at 3,350 GB/s takes approximately 21ms per token - setting a hard floor on generation speed regardless of how many FLOPS the GPU has.

The WSE-3's 21 PB/s on-chip bandwidth reduces this floor dramatically for weights that reside in SRAM. For a model that fits entirely in the 44GB on-chip memory (roughly a 20B parameter model at FP16, or a 40B model at FP8), the bandwidth advantage translates directly into faster token generation.

For models that exceed 44GB - which includes essentially all production LLMs at FP16 precision - the MemoryX external storage system becomes the bandwidth bottleneck. Cerebras has not published MemoryX bandwidth specifications, which means the actual inference speed for large models depends on a bandwidth figure the company has not disclosed. This is a meaningful gap in the publicly available data.

Training Performance

For training workloads, the WSE-3's advantage is most pronounced on models that exhibit high compute-to-memory ratios. Dense matrix multiplications, convolutions, and attention computations that can be decomposed across the 900,000 cores benefit from the massive parallelism and on-chip bandwidth.

Cerebras has published training benchmarks showing competitive or superior performance to GPU clusters for models including GPT-3 (175B parameters), Llama 2 (70B), and various scientific computing workloads. The company reports near-linear scaling efficiency for data-parallel training, which suggests that the on-chip interconnect fabric provides sufficient bandwidth for gradient synchronization across the 900,000 cores.

The sparsity advantage is also significant. The WSE-3's fine-grained parallel architecture can exploit structured and unstructured sparsity more efficiently than GPUs, because each of the 900,000 cores can independently skip zero-valued computations without the warp-synchronization overhead that limits GPU sparsity performance. Cerebras claims 2-4x speedups on sparse workloads compared to dense equivalents.

Key Capabilities

Wafer-Scale Integration. The defining feature of the WSE-3 is that it is manufactured as a single piece of silicon. Standard chips are cut from a wafer, packaged individually, and then connected on circuit boards through PCIe slots, socket mounts, or interposers. The WSE-3 skips all of that - the entire wafer is the chip.

This eliminates the packaging overhead, inter-chip communication latency, and bandwidth constraints that limit traditional multi-chip systems. Data moving between two cores on the WSE-3 travels across on-wafer wires, not through PCIe traces, bridge chips, or network cables. The latency and bandwidth of on-wafer communication is orders of magnitude better than any multi-chip interconnect.

The engineering challenge is immense. A standard 300mm wafer contains billions of transistors, and even at 99.99% yield rates, a wafer with 4 trillion transistors would have millions of defective transistors. Cerebras builds redundancy into the design with spare cores that can be activated to replace any defective units - similar in concept to how SSDs over-provision flash cells to replace worn-out blocks.

The manufacturing process also requires custom wafer-scale packaging, power delivery, and cooling solutions that do not exist in conventional semiconductor production. A single wafer-scale chip needs uniform power delivery across a 300mm diameter surface and uniform cooling to prevent thermal hotspots from throttling performance.

The fact that Cerebras ships production CS-3 systems proves that wafer-scale integration is viable as a commercial technology. No other company has attempted this approach at production scale, which means Cerebras has a multi-year head start on the manufacturing learning curve.

MemoryX and SwarmX. The 44GB of on-chip SRAM is not enough to hold the parameters of large language models at full precision. A 70B parameter model in FP16 requires 140GB - more than 3x the on-chip capacity. A 405B model requires 810GB. Cerebras addresses this with MemoryX, an external memory subsystem that stores model weights and streams them to the WSE-3 during computation.

MemoryX uses high-bandwidth memory modules located outside the WSE-3 wafer but connected via a dedicated high-bandwidth link. The weights are streamed to the WSE-3 on demand, layer by layer, as the computation progresses through the model. The 44GB of on-chip SRAM serves as a working buffer for the current layer's activations and intermediate results, while MemoryX provides the backing store for the full model.

SwarmX is the interconnect fabric that enables multiple CS-3 systems to work together on a single training job. It handles gradient synchronization for data-parallel training and activation passing for pipeline-parallel training across CS-3 nodes.

Together, MemoryX and SwarmX allow the WSE-3 to train models with trillions of parameters despite the limited on-chip memory. The weights live in MemoryX, the activations and intermediate computations happen on the 44GB SRAM, and the 21 PB/s on-chip bandwidth ensures compute cores are never starved for data during the computation phase.

The MemoryX architecture makes the WSE-3 function as a massive compute engine with a streaming memory interface, rather than a self-contained processor. Model weights flow into the chip, computations happen entirely on-chip at 21 PB/s, and results flow back out. This is a fundamentally different execution model than GPUs, where weights, activations, and intermediate results all compete for the same HBM bandwidth.

Cerebras Inference. Cerebras offers inference as a service through their cloud platform, where the WSE-3's massive on-chip bandwidth translates to extremely fast token generation for LLM serving.

For models that fit in the 44GB SRAM (approximately a 20B parameter model at FP16, or a 40B model at FP8), inference runs entirely on-chip with no external memory bottleneck. Cerebras has demonstrated inference speeds exceeding 1,000 tokens per second on Llama-class models, competitive with Groq's LPU on single-stream latency.

For larger models that require MemoryX streaming, inference performance depends on the MemoryX-to-WSE bandwidth, which Cerebras has not published. This is the key unknown for evaluating Cerebras Inference on production-scale models.

The inference service is available through the Cerebras API with per-token pricing, providing cloud access to WSE-3 performance without requiring a $2-3M hardware purchase.

Condor Galaxy Supercomputers. Cerebras has deployed large-scale supercomputer installations built from clusters of CS-3 systems. The Condor Galaxy series, built in partnership with G42 (an Abu Dhabi-based technology company), combines hundreds of CS-3 nodes into multi-exaflop AI supercomputers.

Condor Galaxy 1 and 2 together provide 8 EXAFLOPS of combined AI compute - enough to train frontier-scale models from scratch. These installations demonstrate that the WSE-3 scales beyond individual systems into cluster-scale infrastructure.

For organizations that need supercomputer-class training capability but want an alternative to NVIDIA-dominated GPU clusters, the Condor Galaxy model provides a turnkey solution. G42, Mayo Clinic, and other Cerebras partners have used these clusters for large-scale model training and scientific computing workloads.

Architecture Deep Dive

The Scale of the WSE-3

To appreciate the WSE-3, consider the numbers in context:

MetricWSE-3NVIDIA H100Ratio
Die area~46,225 mm2814 mm257x
Transistors4 trillion80 billion50x
Compute cores900,000132 SMs~6,800x
On-chip SRAM44GB~50MB (L2)~880x
On-chip bandwidth21 PB/s~12 TB/s (L2)~1,750x

The H100 is considered a large chip by industry standards. The WSE-3 is 57x larger. The number of independent compute cores - 900,000 - is more than the total number of CUDA cores across an entire 8-GPU DGX H100 node (8 x 16,896 = 135,168 CUDA cores).

The 2D Mesh Fabric

The 900,000 cores on the WSE-3 are connected by a 2D mesh interconnect fabric that spans the entire wafer. Each core can communicate directly with its neighbors in all four cardinal directions (north, south, east, west), and the mesh routing protocol handles multi-hop communication for cores that are not directly adjacent.

The mesh topology is well-suited for the communication patterns of neural network training:

  • Data-parallel gradient reduction: Reduce operations can be mapped to efficient tree-based communication patterns on the 2D mesh
  • Tensor parallelism: Adjacent cores can share activation data with minimal latency
  • Pipeline parallelism: Layers assigned to contiguous regions of the mesh can stream activations through the pipeline with near-zero communication overhead

The 21 PB/s aggregate bandwidth represents the total capacity of all mesh links on the wafer operating simultaneously. Individual core-to-core bandwidth is lower, but the aggregate throughput matters because neural network training involves distributed computation across all 900,000 cores simultaneously.

Yield and Redundancy

Semiconductor manufacturing is inherently imperfect. At TSMC's 5nm node, even with excellent process control, a wafer will have defective transistors. For a standard chip with a few hundred mm2 of die area, the probability of a defect is low enough that the majority of dies on a wafer are functional.

For the WSE-3, which uses the entire wafer, every defect that would have killed a standard die is present on the chip. Cerebras handles this through redundancy - the wafer includes more cores and more SRAM than the nominal specifications. Defective cores are identified during testing and replaced with spare cores from the redundancy pool. The mesh interconnect is also designed to route around defective links.

This redundancy approach has a cost: some fraction of the wafer's area is "wasted" on spare cores that are not used on defect-free wafers. But it enables the fundamental breakthrough of wafer-scale computing by making yield management tractable.

Pricing and Availability

The CS-3 system is priced at approximately $2-3 million per unit, making it a capital expenditure that only large enterprises, research institutions, and well-funded AI labs can justify.

SystemApprox. PriceFP8 PerformancePrice per PFLOPS
Cerebras CS-3 (WSE-3)~$2.5M~125 PFLOPS~$20K/PFLOPS
NVIDIA DGX H100 (8x H100)~$350K31.6 PFLOPS~$11K/PFLOPS
NVIDIA DGX B200 (8x B200)~$450K72 PFLOPS~$6.3K/PFLOPS

On a pure price-per-PFLOPS basis, the CS-3 is more expensive than GPU alternatives. The NVIDIA DGX B200 is roughly 3x more cost-efficient per FLOP. The value proposition for the CS-3 lies elsewhere.

The 21 PB/s on-chip bandwidth advantage means that for bandwidth-bound workloads - which includes most LLM inference and many training scenarios - the effective performance per dollar can be significantly better than what the raw FLOPS comparison suggests. A GPU cluster with 4x fewer FLOPS but 800x less bandwidth may actually be slower on bandwidth-bound workloads than the CS-3.

For organizations evaluating the CS-3, the decision framework is:

  • If your workloads are compute-bound (large-batch training with high arithmetic intensity), GPUs offer better price-performance
  • If your workloads are bandwidth-bound (inference, small-batch training, sparse models), the WSE-3's architectural advantage closes the price-performance gap or tips it in Cerebras's favor

Cloud Access

Cerebras offers cloud access through multiple channels:

  • Cerebras Inference: Per-token API access for inference workloads
  • Condor Galaxy: Managed supercomputer access through partnerships (G42, others)
  • On-premises CS-3: Direct hardware purchase for organizations with the budget and data center infrastructure

For organizations that cannot justify a $2-3M hardware purchase, cloud access provides an alternative path to WSE-3 performance without the capital expenditure.

Who Should Consider the WSE-3

Strong fit:

  • Organizations training models at 100B+ parameter scale where bandwidth-bound performance matters
  • Research institutions working on sparse model architectures that benefit from fine-grained parallelism
  • AI labs that need an alternative to NVIDIA for supply chain diversification
  • Scientific computing workloads (drug discovery, climate modeling, molecular dynamics) that map well to the 900,000-core architecture
  • Organizations with the budget and data center infrastructure for $2-3M systems at 23kW power draw

Weak fit:

  • Startups or small teams that cannot justify multi-million dollar hardware purchases
  • Inference-only deployments where Groq's LPU offers better latency at lower cost
  • Teams that depend on the CUDA ecosystem and cannot invest in porting to Cerebras's SDK
  • Organizations that need broad cloud availability or multi-provider deployment flexibility
  • Workloads where the model fits comfortably in GPU HBM and bandwidth is not the bottleneck

Strengths

  • 21 PB/s on-chip bandwidth is 300-800x higher than any HBM-based solution
  • 900,000 cores on a single wafer eliminates inter-chip communication overhead
  • ~125 PFLOPS FP8 per system is 4x a DGX H100 in a single unit
  • Wafer-scale redundancy engineering proves full-wafer integration is a viable production technology
  • MemoryX and SwarmX allow scaling to trillion-parameter models despite limited on-chip SRAM
  • Supports both training and inference - not limited to one or the other
  • Cerebras Inference service provides cloud access without the $2-3M hardware purchase
  • ~5.4 PFLOPS per watt is among the best energy efficiency ratios in the market

Weaknesses

  • $2-3M per CS-3 system limits the buyer pool to well-funded organizations
  • 44GB on-chip SRAM is a fraction of the 640GB-1,536GB HBM available on GPU systems
  • Price-per-PFLOPS is roughly 2-3x worse than current-generation NVIDIA GPU systems
  • MemoryX external storage adds latency and complexity for models exceeding 44GB
  • Software ecosystem is far less mature than CUDA - smaller developer community and fewer examples
  • Single-vendor dependency with no second-source option for the wafer-scale form factor
  • 23kW system power requires significant data center cooling and power infrastructure
  • Groq LPU - Another SRAM-based architecture, optimized purely for inference with deterministic latency
  • Intel Gaudi 3 - A more conventional ASIC approach to challenging NVIDIA's GPU dominance
  • AWS Trainium2 - Amazon's cloud-native alternative for large-scale training workloads

Sources

Cerebras WSE-3 - The Wafer-Scale AI Engine
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.