TL;DR

5th-generation Reconfigurable Dataflow Unit on TSMC 3nm - dual-chiplet design with ~2,080 PCUs and ~2,080 PMUs
3.2 PFLOPS FP8 and 1.6 PFLOPS BF16 per chip - SambaNova claims 5x max speed and 3x efficiency over NVIDIA B200, but these numbers are unverified by independent benchmarks
Three-tier memory hierarchy: 432 MB on-chip SRAM at hundreds of TB/s, 64 GB HBM2E at 1.8 TB/s, and up to 2 TB DDR5
Scales to 256 RDUs per inference worker and 32,768 in scaleout - targets 10 trillion parameter models with 10 million token context
First deployment at SoftBank data centers in Japan, backed by $350M Series E with Intel as investor

Overview

SambaNova's SN50 is the fifth-generation Reconfigurable Dataflow Unit, and it represents the company's most aggressive swing at NVIDIA's dominance in AI inference hardware. Built on TSMC's 3nm (N3) process with a dual-chiplet design, the SN50 roughly doubles the compute unit count of the previous SN40L while adding native FP8 support and a three-tier memory system designed specifically for the long-context, multi-step reasoning workloads that agentic AI demands. SambaNova claims the SN50 delivers 5x maximum throughput and 3x better energy efficiency over the NVIDIA B200, with 8x total cost of ownership savings.

Those are bold claims, and I need to state this upfront: none of them have been independently verified. Every performance number in this article comes from SambaNova's own benchmarks and marketing materials. There are no MLPerf submissions, no third-party inference benchmark results, and no independent lab validations. The SN50 isn't shipping yet - it's scheduled for the second half of 2026 - so outside verification is not possible now. I'm presenting SambaNova's claims as precisely that: claims. The underlying architecture is interesting enough to warrant serious attention, but the performance numbers should be treated with appropriate skepticism until independent data arrives.

What makes the SN50 worth watching is the dataflow architecture itself. Unlike GPUs, which use a general-purpose SIMT execution model, the RDU maps computation graphs directly to hardware at compile time. Data flows through the chip in a predetermined pattern, with each Pattern Compute Unit (PCU) and Pattern Memory Unit (PMU) executing its assigned portion of the workload without the scheduling overhead that comes with GPU thread management. This is a fundamentally different approach to inference, and the architectural bet is that removing runtime scheduling overhead pays off at scale - especially for the multi-model, long-context orchestration patterns that agentic AI workloads require.

Key Specifications

Specification	Details
Manufacturer	SambaNova Systems
Architecture	RDU (Reconfigurable Dataflow Unit), 5th generation
Process Node	TSMC 3nm (N3)
Design	Dual-chiplet
Pattern Compute Units (PCUs)	~2,080 (doubled from SN40L)
Pattern Memory Units (PMUs)	~2,080 (doubled from SN40L)
Clock Speed	~2.35 GHz (25% increase over SN40L)
On-Chip SRAM	432 MB
On-Chip SRAM Bandwidth	Hundreds of TB/s
HBM	64 GB HBM2E
HBM Bandwidth	1,800 GB/s (1.8 TB/s)
DDR5	256 GB to 2 TB
FP8 Performance	~3,200 TFLOPS (3.2 PFLOPS)
BF16/FP16 Performance	~1,600 TFLOPS (1.6 PFLOPS)
Chip-to-Chip Interconnect	2.2 TB/s bidirectional (switched fabric)
Max RDUs per Inference Worker	256
Max RDUs per Domain	2,048
Max RDUs in Scaleout	32,768
System Form Factor	SambaRack (16 SN50 chips)
System Power	15-30 kW per SambaRack
Cooling	Air-cooled
TDP (per chip)	Not officially disclosed
Target Workload	Inference (agentic AI focus)
Release Date	2026-H2

Performance Benchmarks (SambaNova Claims)

Important: All performance figures in this table are SambaNova's own benchmarks and haven't been independently verified.

Metric	SambaNova SN50 (claimed)	NVIDIA B200	Groq LPU
FP8 TFLOPS (per chip)	3,200	4,500 (dense) / 9,000 (sparse)	~750
BF16/FP16 TFLOPS (per chip)	1,600	2,250 (dense) / 4,500 (sparse)	~188
On-Chip SRAM	432 MB	96 MB L2 cache	230 MB
HBM Capacity	64 GB HBM2E	192 GB HBM3e	None
HBM Bandwidth	1,800 GB/s	8,000 GB/s	N/A
Max Model Size (claimed)	10T parameters	~380B (single GPU, FP4)	Limited by SRAM
Max Context Length (claimed)	10M tokens	Depends on KV cache	Limited by SRAM
Llama 3.3 70B (tok/s/user)	895 (claimed)	184	~500
DeepSeek R1 671B (tok/s/user)	~250 (claimed)	~19 (avg. GPU provider)	N/A
Process Node	TSMC 3nm	TSMC 4NP	14nm (GF)
TDP	Not disclosed	1,000W	~300W

Several things stand out in this comparison. On raw FP8 TFLOPS, the B200 actually leads the SN50 - 4,500 dense (9,000 sparse) versus 3,200. SambaNova's throughput advantage, if real, comes not from raw FLOPS but from architectural efficiency: the dataflow execution model, the three-tier memory hierarchy, and the compiler's ability to keep data on-chip and minimize memory round-trips. The claim is that the RDU wastes fewer cycles on scheduling, memory management, and data movement than a GPU does, and that this architectural efficiency compounds into higher real-world throughput despite lower peak FLOPS.

The Llama 3.3 70B benchmark - 895 tok/s/user versus 184 tok/s/user on B200 - is a 4.9x advantage. If accurate, that is a significant result. But there are open questions: what batch size was used? What quantization precision? Was the B200 running TensorRT-LLM with the latest optimizations? SambaNova has not published the full methodology, which makes it impossible to reproduce or verify these numbers.

The DeepSeek R1 671B comparison is even harder to assess. The ~250 tok/s/user claim is compared against ~19 tok/s/user from an unnamed "average GPU provider" - not against a specific B200 configuration with a defined software stack. That is a marketing comparison, not a benchmark.

Key Capabilities

Dataflow Architecture. The RDU's defining characteristic is its dataflow execution model. In a GPU, the hardware dynamically schedules warps of threads across streaming multiprocessors, managing memory access patterns, cache coherency, and thread synchronization at runtime. The RDU removes this runtime overhead by mapping the entire computation graph to hardware at compile time through SambaNova's SambaFlow compiler.

Each PCU handles computation (matrix multiplications, activations, normalization) and each PMU manages local data storage and movement. The compiler determines exactly which data flows between which units at which clock cycle before execution begins. There's no dynamic scheduling, no cache miss penalties, and no thread divergence. This is architecturally similar in philosophy to Groq's LPU, which also uses static scheduling, but the RDU differs in a key way: it's reconfigurable. The dataflow graph can be reprogrammed for different models without hardware changes, whereas the compilation determines how the fixed hardware resources are used.

The trade-off is the same one every non-GPU architecture faces: the software stack must compensate for the loss of CUDA's enormous ecosystem. SambaFlow provides the compiler, runtime, and Python SDK, but it isn't CUDA, and the developer community and tooling ecosystem are orders of magnitude smaller.

Three-Tier Memory Hierarchy. The SN50's most distinctive hardware feature is its three-level memory system, and this is where the agentic AI pitch gets concrete.

The first tier is 432 MB of on-chip SRAM distributed across the PMUs, running at hundreds of TB/s of aggregate bandwidth. This is where hot data lives - attention KV caches, intermediate activations, and frequently accessed model weights. For context, this is 4.5x the B200's 96 MB L2 cache and roughly double the Groq LPU's 230 MB SRAM, though the architectural roles aren't directly comparable.

The second tier is 64 GB of HBM2E at 1.8 TB/s. This holds model weights and larger working sets. Worth noting: SambaNova chose HBM2E rather than HBM3e - a generation behind what the B200 uses. The 1.8 TB/s bandwidth is less than a quarter of the B200's 8 TB/s. SambaNova's argument is that the 432 MB SRAM tier absorbs enough memory traffic to reduce pressure on HBM, making the slower HBM acceptable in practice. Whether that holds up across varied workloads remains to be proven.

The third tier is 256 GB to 2 TB of DDR5 attached to each RDU. This is unique among AI accelerators and is specifically designed for agentic workloads where multi-million-token context windows and persistent agent state need to be accessible without going off-chip to system memory or storage. No other inference chip offers this kind of local DRAM capacity per accelerator.

Agentic AI Focus. SambaNova has positioned the SN50 specifically for agentic inference - workloads where multiple models collaborate, maintain long-running context, and call external tools in multi-step reasoning chains. The combination of the three-tier memory (keeping agent state local to the chip), the 10-million-token context window claim, and the scaleout to 10-trillion-parameter models is aimed squarely at the emerging pattern of coordinating multiple specialized models within a single inference pipeline.

The 256-RDU inference worker configuration enables serving very large models without the bandwidth penalties of cross-node communication that plague GPU clusters at similar scale. SambaNova's switched fabric interconnect runs at 2.2 TB/s bidirectional between chips, which is higher than the B200's 1,800 GB/s NVLink per GPU, though the topologies and use patterns differ enough that direct comparison isn't straightforward.

Software Stack

The SN50 runs on SambaNova's SambaFlow software stack, which includes a Python SDK, a dataflow compiler that maps models to the RDU hardware, and a runtime for execution management. SambaTune provides profiling and optimization tools.

SambaFlow supports major model architectures including transformer-based LLMs, mixture-of-experts models, and multi-modal architectures. The compiler handles the translation from standard PyTorch model definitions to dataflow graphs mapped onto the PCU/PMU fabric. SambaNova claims this compilation is automatic and doesn't require manual hardware-level optimization from the user.

The practical question is ecosystem maturity. CUDA has decades of tooling, debugging infrastructure, and community knowledge. SambaFlow is a proprietary stack with a much smaller user base. For organizations evaluating the SN50, the software stack's readiness for production workloads is as important as the hardware specifications.

Pricing and Availability

SambaNova hasn't disclosed per-chip or per-system pricing for the SN50. The company raised $350 million in a Series E round, with Intel participating as an investor. The partnership with Intel involves deploying Xeon CPUs with SN50 RDUs for host processing in inference and agentic workloads.

The first confirmed deployment is at SoftBank's data centers in Japan, where SambaRack systems (16 SN50 chips per rack, 15-30 kW, air-cooled) will be installed. General availability is expected in the second half of 2026.

The air-cooling design is worth noting. At 15-30 kW for a 16-chip rack, the per-chip power draw falls in the roughly 940W to 1,875W range (including non-GPU components), but SambaNova hasn't broken out the per-chip TDP. The fact that these systems are air-cooled at this power envelope is a deployment advantage over the B200, which requires liquid cooling at 1,000W per GPU. Air-cooled systems can be installed in existing data center infrastructure without plumbing modifications.

Until pricing is public, cost comparisons with NVIDIA and other alternatives are impossible to make strictly. SambaNova's claim of 8x TCO savings over B200 is unverifiable.

Strengths

Dataflow architecture removes GPU scheduling overhead - computation graphs are mapped to hardware at compile time
Three-tier memory with 432 MB SRAM, 64 GB HBM2E, and up to 2 TB DDR5 is uniquely suited for long-context agentic workloads
TSMC 3nm process is the most advanced node used by any AI inference ASIC currently announced
Air-cooled SambaRack design (15-30 kW) avoids the liquid cooling infrastructure required by B200 deployments
Massive scaleout to 32,768 RDUs enables serving models up to 10 trillion parameters (claimed)
10 million token context window support (claimed) addresses a genuine gap in current inference infrastructure
2.2 TB/s chip-to-chip switched fabric provides high-bandwidth inter-chip communication
Strong backing with $350M Series E and SoftBank as first deployment partner

Weaknesses

Zero independent benchmark validation - all performance claims come from SambaNova's own testing
Not shipping until H2 2026 - the B200 is available now and next-generation NVIDIA hardware will be on the horizon
64 GB HBM2E at 1.8 TB/s is a generation behind the B200's 192 GB HBM3e at 8 TB/s
SambaFlow is a proprietary software stack with a small ecosystem compared to CUDA
No disclosed pricing makes TCO analysis impossible for potential buyers
Per-chip TDP not disclosed - total system power suggests it may be comparable to or higher than B200
Limited deployment track record - SambaNova's previous-generation chips have a much smaller install base than NVIDIA or even Groq
Comparison benchmarks use vague baselines ("average GPU provider") rather than specific, reproducible configurations

NVIDIA B200 - Blackwell Flagship GPU - The incumbent SambaNova is benchmarking against, with 9,000 TFLOPS sparse FP8 and 192GB HBM3e
Groq LPU - Deterministic Inference at Scale - Another non-GPU inference ASIC using static scheduling and on-chip SRAM
NVIDIA H100 SXM - The AI Training Benchmark - The previous-generation GPU still widely used for inference

SambaNova SN50 RDU - Agentic Inference Chip

Overview

Key Specifications

Performance Benchmarks (SambaNova Claims)

Key Capabilities

Software Stack

Pricing and Availability

Strengths

Weaknesses

Sources

Overview

Key Specifications

Performance Benchmarks (SambaNova Claims)

Key Capabilities

Software Stack

Pricing and Availability

Strengths

Weaknesses

Related Coverage

Sources

Google Analytics