Hardware

SambaNova SN50 RDU - Agentic Inference Chip

Complete specs and analysis of SambaNova's SN50 RDU - a TSMC 3nm dataflow chip with 3.2 PFLOPS FP8, three-tier memory, and claimed 5x speed over NVIDIA B200.

SambaNova SN50 RDU - Agentic Inference Chip

TL;DR

  • 5th-generation Reconfigurable Dataflow Unit on TSMC 3nm - dual-chiplet design with ~2,080 PCUs and ~2,080 PMUs
  • 3.2 PFLOPS FP8 and 1.6 PFLOPS BF16 per chip - SambaNova claims 5x max speed and 3x efficiency over NVIDIA B200, but these numbers are unverified by independent benchmarks
  • Three-tier memory hierarchy: 432 MB on-chip SRAM at hundreds of TB/s, 64 GB HBM2E at 1.8 TB/s, and up to 2 TB DDR5
  • Scales to 256 RDUs per inference worker and 32,768 in scaleout - targets 10 trillion parameter models with 10 million token context
  • First deployment at SoftBank data centers in Japan, backed by $350M Series E with Intel as investor

Overview

SambaNova's SN50 is the fifth-generation Reconfigurable Dataflow Unit, and it represents the company's most aggressive swing at NVIDIA's dominance in AI inference hardware. Built on TSMC's 3nm (N3) process with a dual-chiplet design, the SN50 roughly doubles the compute unit count of the previous SN40L while adding native FP8 support and a three-tier memory system designed specifically for the long-context, multi-step reasoning workloads that agentic AI demands. SambaNova claims the SN50 delivers 5x maximum throughput and 3x better energy efficiency over the NVIDIA B200, with 8x total cost of ownership savings.

Those are bold claims, and I need to state this upfront: none of them have been independently verified. Every performance number in this article comes from SambaNova's own benchmarks and marketing materials. There are no MLPerf submissions, no third-party inference benchmark results, and no independent lab validations. The SN50 isn't shipping yet - it's scheduled for the second half of 2026 - so outside verification is not possible now. I'm presenting SambaNova's claims as precisely that: claims. The underlying architecture is interesting enough to warrant serious attention, but the performance numbers should be treated with appropriate skepticism until independent data arrives.

What makes the SN50 worth watching is the dataflow architecture itself. Unlike GPUs, which use a general-purpose SIMT execution model, the RDU maps computation graphs directly to hardware at compile time. Data flows through the chip in a predetermined pattern, with each Pattern Compute Unit (PCU) and Pattern Memory Unit (PMU) executing its assigned portion of the workload without the scheduling overhead that comes with GPU thread management. This is a fundamentally different approach to inference, and the architectural bet is that removing runtime scheduling overhead pays off at scale - especially for the multi-model, long-context orchestration patterns that agentic AI workloads require.

Key Specifications

SpecificationDetails
ManufacturerSambaNova Systems
ArchitectureRDU (Reconfigurable Dataflow Unit), 5th generation
Process NodeTSMC 3nm (N3)
DesignDual-chiplet
Pattern Compute Units (PCUs)~2,080 (doubled from SN40L)
Pattern Memory Units (PMUs)~2,080 (doubled from SN40L)
Clock Speed~2.35 GHz (25% increase over SN40L)
On-Chip SRAM432 MB
On-Chip SRAM BandwidthHundreds of TB/s
HBM64 GB HBM2E
HBM Bandwidth1,800 GB/s (1.8 TB/s)
DDR5256 GB to 2 TB
FP8 Performance~3,200 TFLOPS (3.2 PFLOPS)
BF16/FP16 Performance~1,600 TFLOPS (1.6 PFLOPS)
Chip-to-Chip Interconnect2.2 TB/s bidirectional (switched fabric)
Max RDUs per Inference Worker256
Max RDUs per Domain2,048
Max RDUs in Scaleout32,768
System Form FactorSambaRack (16 SN50 chips)
System Power15-30 kW per SambaRack
CoolingAir-cooled
TDP (per chip)Not officially disclosed
Target WorkloadInference (agentic AI focus)
Release Date2026-H2

Performance Benchmarks (SambaNova Claims)

Important: All performance figures in this table are SambaNova's own benchmarks and haven't been independently verified.

MetricSambaNova SN50 (claimed)NVIDIA B200Groq LPU
FP8 TFLOPS (per chip)3,2004,500 (dense) / 9,000 (sparse)~750
BF16/FP16 TFLOPS (per chip)1,6002,250 (dense) / 4,500 (sparse)~188
On-Chip SRAM432 MB96 MB L2 cache230 MB
HBM Capacity64 GB HBM2E192 GB HBM3eNone
HBM Bandwidth1,800 GB/s8,000 GB/sN/A
Max Model Size (claimed)10T parameters~380B (single GPU, FP4)Limited by SRAM
Max Context Length (claimed)10M tokensDepends on KV cacheLimited by SRAM
Llama 3.3 70B (tok/s/user)895 (claimed)184~500
DeepSeek R1 671B (tok/s/user)~250 (claimed)~19 (avg. GPU provider)N/A
Process NodeTSMC 3nmTSMC 4NP14nm (GF)
TDPNot disclosed1,000W~300W

Several things stand out in this comparison. On raw FP8 TFLOPS, the B200 actually leads the SN50 - 4,500 dense (9,000 sparse) versus 3,200. SambaNova's throughput advantage, if real, comes not from raw FLOPS but from architectural efficiency: the dataflow execution model, the three-tier memory hierarchy, and the compiler's ability to keep data on-chip and minimize memory round-trips. The claim is that the RDU wastes fewer cycles on scheduling, memory management, and data movement than a GPU does, and that this architectural efficiency compounds into higher real-world throughput despite lower peak FLOPS.

The Llama 3.3 70B benchmark - 895 tok/s/user versus 184 tok/s/user on B200 - is a 4.9x advantage. If accurate, that is a significant result. But there are open questions: what batch size was used? What quantization precision? Was the B200 running TensorRT-LLM with the latest optimizations? SambaNova has not published the full methodology, which makes it impossible to reproduce or verify these numbers.

The DeepSeek R1 671B comparison is even harder to assess. The ~250 tok/s/user claim is compared against ~19 tok/s/user from an unnamed "average GPU provider" - not against a specific B200 configuration with a defined software stack. That is a marketing comparison, not a benchmark.

Key Capabilities

Dataflow Architecture. The RDU's defining characteristic is its dataflow execution model. In a GPU, the hardware dynamically schedules warps of threads across streaming multiprocessors, managing memory access patterns, cache coherency, and thread synchronization at runtime. The RDU removes this runtime overhead by mapping the entire computation graph to hardware at compile time through SambaNova's SambaFlow compiler.

Each PCU handles computation (matrix multiplications, activations, normalization) and each PMU manages local data storage and movement. The compiler determines exactly which data flows between which units at which clock cycle before execution begins. There's no dynamic scheduling, no cache miss penalties, and no thread divergence. This is architecturally similar in philosophy to Groq's LPU, which also uses static scheduling, but the RDU differs in a key way: it's reconfigurable. The dataflow graph can be reprogrammed for different models without hardware changes, whereas the compilation determines how the fixed hardware resources are used.

The trade-off is the same one every non-GPU architecture faces: the software stack must compensate for the loss of CUDA's enormous ecosystem. SambaFlow provides the compiler, runtime, and Python SDK, but it isn't CUDA, and the developer community and tooling ecosystem are orders of magnitude smaller.

Three-Tier Memory Hierarchy. The SN50's most distinctive hardware feature is its three-level memory system, and this is where the agentic AI pitch gets concrete.

The first tier is 432 MB of on-chip SRAM distributed across the PMUs, running at hundreds of TB/s of aggregate bandwidth. This is where hot data lives - attention KV caches, intermediate activations, and frequently accessed model weights. For context, this is 4.5x the B200's 96 MB L2 cache and roughly double the Groq LPU's 230 MB SRAM, though the architectural roles aren't directly comparable.

The second tier is 64 GB of HBM2E at 1.8 TB/s. This holds model weights and larger working sets. Worth noting: SambaNova chose HBM2E rather than HBM3e - a generation behind what the B200 uses. The 1.8 TB/s bandwidth is less than a quarter of the B200's 8 TB/s. SambaNova's argument is that the 432 MB SRAM tier absorbs enough memory traffic to reduce pressure on HBM, making the slower HBM acceptable in practice. Whether that holds up across varied workloads remains to be proven.

The third tier is 256 GB to 2 TB of DDR5 attached to each RDU. This is unique among AI accelerators and is specifically designed for agentic workloads where multi-million-token context windows and persistent agent state need to be accessible without going off-chip to system memory or storage. No other inference chip offers this kind of local DRAM capacity per accelerator.

Agentic AI Focus. SambaNova has positioned the SN50 specifically for agentic inference - workloads where multiple models collaborate, maintain long-running context, and call external tools in multi-step reasoning chains. The combination of the three-tier memory (keeping agent state local to the chip), the 10-million-token context window claim, and the scaleout to 10-trillion-parameter models is aimed squarely at the emerging pattern of coordinating multiple specialized models within a single inference pipeline.

The 256-RDU inference worker configuration enables serving very large models without the bandwidth penalties of cross-node communication that plague GPU clusters at similar scale. SambaNova's switched fabric interconnect runs at 2.2 TB/s bidirectional between chips, which is higher than the B200's 1,800 GB/s NVLink per GPU, though the topologies and use patterns differ enough that direct comparison isn't straightforward.

Software Stack

The SN50 runs on SambaNova's SambaFlow software stack, which includes a Python SDK, a dataflow compiler that maps models to the RDU hardware, and a runtime for execution management. SambaTune provides profiling and optimization tools.

SambaFlow supports major model architectures including transformer-based LLMs, mixture-of-experts models, and multi-modal architectures. The compiler handles the translation from standard PyTorch model definitions to dataflow graphs mapped onto the PCU/PMU fabric. SambaNova claims this compilation is automatic and doesn't require manual hardware-level optimization from the user.

The practical question is ecosystem maturity. CUDA has decades of tooling, debugging infrastructure, and community knowledge. SambaFlow is a proprietary stack with a much smaller user base. For organizations evaluating the SN50, the software stack's readiness for production workloads is as important as the hardware specifications.

Pricing and Availability

SambaNova hasn't disclosed per-chip or per-system pricing for the SN50. The company raised $350 million in a Series E round, with Intel participating as an investor. The partnership with Intel involves deploying Xeon CPUs with SN50 RDUs for host processing in inference and agentic workloads.

The first confirmed deployment is at SoftBank's data centers in Japan, where SambaRack systems (16 SN50 chips per rack, 15-30 kW, air-cooled) will be installed. General availability is expected in the second half of 2026.

The air-cooling design is worth noting. At 15-30 kW for a 16-chip rack, the per-chip power draw falls in the roughly 940W to 1,875W range (including non-GPU components), but SambaNova hasn't broken out the per-chip TDP. The fact that these systems are air-cooled at this power envelope is a deployment advantage over the B200, which requires liquid cooling at 1,000W per GPU. Air-cooled systems can be installed in existing data center infrastructure without plumbing modifications.

Until pricing is public, cost comparisons with NVIDIA and other alternatives are impossible to make strictly. SambaNova's claim of 8x TCO savings over B200 is unverifiable.

Strengths

  • Dataflow architecture removes GPU scheduling overhead - computation graphs are mapped to hardware at compile time
  • Three-tier memory with 432 MB SRAM, 64 GB HBM2E, and up to 2 TB DDR5 is uniquely suited for long-context agentic workloads
  • TSMC 3nm process is the most advanced node used by any AI inference ASIC currently announced
  • Air-cooled SambaRack design (15-30 kW) avoids the liquid cooling infrastructure required by B200 deployments
  • Massive scaleout to 32,768 RDUs enables serving models up to 10 trillion parameters (claimed)
  • 10 million token context window support (claimed) addresses a genuine gap in current inference infrastructure
  • 2.2 TB/s chip-to-chip switched fabric provides high-bandwidth inter-chip communication
  • Strong backing with $350M Series E and SoftBank as first deployment partner

Weaknesses

  • Zero independent benchmark validation - all performance claims come from SambaNova's own testing
  • Not shipping until H2 2026 - the B200 is available now and next-generation NVIDIA hardware will be on the horizon
  • 64 GB HBM2E at 1.8 TB/s is a generation behind the B200's 192 GB HBM3e at 8 TB/s
  • SambaFlow is a proprietary software stack with a small ecosystem compared to CUDA
  • No disclosed pricing makes TCO analysis impossible for potential buyers
  • Per-chip TDP not disclosed - total system power suggests it may be comparable to or higher than B200
  • Limited deployment track record - SambaNova's previous-generation chips have a much smaller install base than NVIDIA or even Groq
  • Comparison benchmarks use vague baselines ("average GPU provider") rather than specific, reproducible configurations

Sources

SambaNova SN50 RDU - Agentic Inference Chip
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.