SambaNova SN50 RDU - Agentic Inference Chip
Complete specs and analysis of SambaNova's SN50 RDU - a TSMC 3nm dataflow chip with 3.2 PFLOPS FP8, three-tier memory, and claimed 5x speed over NVIDIA B200.

TL;DR
- 5th-generation Reconfigurable Dataflow Unit on TSMC 3nm - dual-chiplet design with ~2,080 PCUs and ~2,080 PMUs
- 3.2 PFLOPS FP8 and 1.6 PFLOPS BF16 per chip - SambaNova claims 5x max speed and 3x efficiency over NVIDIA B200, but these numbers are unverified by independent benchmarks
- Three-tier memory hierarchy: 432 MB on-chip SRAM at hundreds of TB/s, 64 GB HBM2E at 1.8 TB/s, and up to 2 TB DDR5
- Scales to 256 RDUs per inference worker and 32,768 in scaleout - targets 10 trillion parameter models with 10 million token context
- First deployment at SoftBank data centers in Japan, backed by $350M Series E with Intel as investor
Overview
SambaNova's SN50 is the fifth-generation Reconfigurable Dataflow Unit, and it represents the company's most aggressive swing at NVIDIA's dominance in AI inference hardware. Built on TSMC's 3nm (N3) process with a dual-chiplet design, the SN50 roughly doubles the compute unit count of the previous SN40L while adding native FP8 support and a three-tier memory system designed specifically for the long-context, multi-step reasoning workloads that agentic AI demands. SambaNova claims the SN50 delivers 5x maximum throughput and 3x better energy efficiency over the NVIDIA B200, with 8x total cost of ownership savings.
Those are bold claims, and I need to state this upfront: none of them have been independently verified. Every performance number in this article comes from SambaNova's own benchmarks and marketing materials. There are no MLPerf submissions, no third-party inference benchmark results, and no independent lab validations. The SN50 isn't shipping yet - it's scheduled for the second half of 2026 - so outside verification is not possible now. I'm presenting SambaNova's claims as precisely that: claims. The underlying architecture is interesting enough to warrant serious attention, but the performance numbers should be treated with appropriate skepticism until independent data arrives.
What makes the SN50 worth watching is the dataflow architecture itself. Unlike GPUs, which use a general-purpose SIMT execution model, the RDU maps computation graphs directly to hardware at compile time. Data flows through the chip in a predetermined pattern, with each Pattern Compute Unit (PCU) and Pattern Memory Unit (PMU) executing its assigned portion of the workload without the scheduling overhead that comes with GPU thread management. This is a fundamentally different approach to inference, and the architectural bet is that removing runtime scheduling overhead pays off at scale - especially for the multi-model, long-context orchestration patterns that agentic AI workloads require.
Key Specifications
| Specification | Details |
|---|---|
| Manufacturer | SambaNova Systems |
| Architecture | RDU (Reconfigurable Dataflow Unit), 5th generation |
| Process Node | TSMC 3nm (N3) |
| Design | Dual-chiplet |
| Pattern Compute Units (PCUs) | ~2,080 (doubled from SN40L) |
| Pattern Memory Units (PMUs) | ~2,080 (doubled from SN40L) |
| Clock Speed | ~2.35 GHz (25% increase over SN40L) |
| On-Chip SRAM | 432 MB |
| On-Chip SRAM Bandwidth | Hundreds of TB/s |
| HBM | 64 GB HBM2E |
| HBM Bandwidth | 1,800 GB/s (1.8 TB/s) |
| DDR5 | 256 GB to 2 TB |
| FP8 Performance | ~3,200 TFLOPS (3.2 PFLOPS) |
| BF16/FP16 Performance | ~1,600 TFLOPS (1.6 PFLOPS) |
| Chip-to-Chip Interconnect | 2.2 TB/s bidirectional (switched fabric) |
| Max RDUs per Inference Worker | 256 |
| Max RDUs per Domain | 2,048 |
| Max RDUs in Scaleout | 32,768 |
| System Form Factor | SambaRack (16 SN50 chips) |
| System Power | 15-30 kW per SambaRack |
| Cooling | Air-cooled |
| TDP (per chip) | Not officially disclosed |
| Target Workload | Inference (agentic AI focus) |
| Release Date | 2026-H2 |
Performance Benchmarks (SambaNova Claims)
Important: All performance figures in this table are SambaNova's own benchmarks and haven't been independently verified.
| Metric | SambaNova SN50 (claimed) | NVIDIA B200 | Groq LPU |
|---|---|---|---|
| FP8 TFLOPS (per chip) | 3,200 | 4,500 (dense) / 9,000 (sparse) | ~750 |
| BF16/FP16 TFLOPS (per chip) | 1,600 | 2,250 (dense) / 4,500 (sparse) | ~188 |
| On-Chip SRAM | 432 MB | 96 MB L2 cache | 230 MB |
| HBM Capacity | 64 GB HBM2E | 192 GB HBM3e | None |
| HBM Bandwidth | 1,800 GB/s | 8,000 GB/s | N/A |
| Max Model Size (claimed) | 10T parameters | ~380B (single GPU, FP4) | Limited by SRAM |
| Max Context Length (claimed) | 10M tokens | Depends on KV cache | Limited by SRAM |
| Llama 3.3 70B (tok/s/user) | 895 (claimed) | 184 | ~500 |
| DeepSeek R1 671B (tok/s/user) | ~250 (claimed) | ~19 (avg. GPU provider) | N/A |
| Process Node | TSMC 3nm | TSMC 4NP | 14nm (GF) |
| TDP | Not disclosed | 1,000W | ~300W |
Several things stand out in this comparison. On raw FP8 TFLOPS, the B200 actually leads the SN50 - 4,500 dense (9,000 sparse) versus 3,200. SambaNova's throughput advantage, if real, comes not from raw FLOPS but from architectural efficiency: the dataflow execution model, the three-tier memory hierarchy, and the compiler's ability to keep data on-chip and minimize memory round-trips. The claim is that the RDU wastes fewer cycles on scheduling, memory management, and data movement than a GPU does, and that this architectural efficiency compounds into higher real-world throughput despite lower peak FLOPS.
The Llama 3.3 70B benchmark - 895 tok/s/user versus 184 tok/s/user on B200 - is a 4.9x advantage. If accurate, that is a significant result. But there are open questions: what batch size was used? What quantization precision? Was the B200 running TensorRT-LLM with the latest optimizations? SambaNova has not published the full methodology, which makes it impossible to reproduce or verify these numbers.
The DeepSeek R1 671B comparison is even harder to assess. The ~250 tok/s/user claim is compared against ~19 tok/s/user from an unnamed "average GPU provider" - not against a specific B200 configuration with a defined software stack. That is a marketing comparison, not a benchmark.
Key Capabilities
Dataflow Architecture. The RDU's defining characteristic is its dataflow execution model. In a GPU, the hardware dynamically schedules warps of threads across streaming multiprocessors, managing memory access patterns, cache coherency, and thread synchronization at runtime. The RDU removes this runtime overhead by mapping the entire computation graph to hardware at compile time through SambaNova's SambaFlow compiler.
Each PCU handles computation (matrix multiplications, activations, normalization) and each PMU manages local data storage and movement. The compiler determines exactly which data flows between which units at which clock cycle before execution begins. There's no dynamic scheduling, no cache miss penalties, and no thread divergence. This is architecturally similar in philosophy to Groq's LPU, which also uses static scheduling, but the RDU differs in a key way: it's reconfigurable. The dataflow graph can be reprogrammed for different models without hardware changes, whereas the compilation determines how the fixed hardware resources are used.
The trade-off is the same one every non-GPU architecture faces: the software stack must compensate for the loss of CUDA's enormous ecosystem. SambaFlow provides the compiler, runtime, and Python SDK, but it isn't CUDA, and the developer community and tooling ecosystem are orders of magnitude smaller.
Three-Tier Memory Hierarchy. The SN50's most distinctive hardware feature is its three-level memory system, and this is where the agentic AI pitch gets concrete.
The first tier is 432 MB of on-chip SRAM distributed across the PMUs, running at hundreds of TB/s of aggregate bandwidth. This is where hot data lives - attention KV caches, intermediate activations, and frequently accessed model weights. For context, this is 4.5x the B200's 96 MB L2 cache and roughly double the Groq LPU's 230 MB SRAM, though the architectural roles aren't directly comparable.
The second tier is 64 GB of HBM2E at 1.8 TB/s. This holds model weights and larger working sets. Worth noting: SambaNova chose HBM2E rather than HBM3e - a generation behind what the B200 uses. The 1.8 TB/s bandwidth is less than a quarter of the B200's 8 TB/s. SambaNova's argument is that the 432 MB SRAM tier absorbs enough memory traffic to reduce pressure on HBM, making the slower HBM acceptable in practice. Whether that holds up across varied workloads remains to be proven.
The third tier is 256 GB to 2 TB of DDR5 attached to each RDU. This is unique among AI accelerators and is specifically designed for agentic workloads where multi-million-token context windows and persistent agent state need to be accessible without going off-chip to system memory or storage. No other inference chip offers this kind of local DRAM capacity per accelerator.
Agentic AI Focus. SambaNova has positioned the SN50 specifically for agentic inference - workloads where multiple models collaborate, maintain long-running context, and call external tools in multi-step reasoning chains. The combination of the three-tier memory (keeping agent state local to the chip), the 10-million-token context window claim, and the scaleout to 10-trillion-parameter models is aimed squarely at the emerging pattern of coordinating multiple specialized models within a single inference pipeline.
The 256-RDU inference worker configuration enables serving very large models without the bandwidth penalties of cross-node communication that plague GPU clusters at similar scale. SambaNova's switched fabric interconnect runs at 2.2 TB/s bidirectional between chips, which is higher than the B200's 1,800 GB/s NVLink per GPU, though the topologies and use patterns differ enough that direct comparison isn't straightforward.
Software Stack
The SN50 runs on SambaNova's SambaFlow software stack, which includes a Python SDK, a dataflow compiler that maps models to the RDU hardware, and a runtime for execution management. SambaTune provides profiling and optimization tools.
SambaFlow supports major model architectures including transformer-based LLMs, mixture-of-experts models, and multi-modal architectures. The compiler handles the translation from standard PyTorch model definitions to dataflow graphs mapped onto the PCU/PMU fabric. SambaNova claims this compilation is automatic and doesn't require manual hardware-level optimization from the user.
The practical question is ecosystem maturity. CUDA has decades of tooling, debugging infrastructure, and community knowledge. SambaFlow is a proprietary stack with a much smaller user base. For organizations evaluating the SN50, the software stack's readiness for production workloads is as important as the hardware specifications.
Pricing and Availability
SambaNova hasn't disclosed per-chip or per-system pricing for the SN50. The company raised $350 million in a Series E round, with Intel participating as an investor. The partnership with Intel involves deploying Xeon CPUs with SN50 RDUs for host processing in inference and agentic workloads.
The first confirmed deployment is at SoftBank's data centers in Japan, where SambaRack systems (16 SN50 chips per rack, 15-30 kW, air-cooled) will be installed. General availability is expected in the second half of 2026.
The air-cooling design is worth noting. At 15-30 kW for a 16-chip rack, the per-chip power draw falls in the roughly 940W to 1,875W range (including non-GPU components), but SambaNova hasn't broken out the per-chip TDP. The fact that these systems are air-cooled at this power envelope is a deployment advantage over the B200, which requires liquid cooling at 1,000W per GPU. Air-cooled systems can be installed in existing data center infrastructure without plumbing modifications.
Until pricing is public, cost comparisons with NVIDIA and other alternatives are impossible to make strictly. SambaNova's claim of 8x TCO savings over B200 is unverifiable.
Strengths
- Dataflow architecture removes GPU scheduling overhead - computation graphs are mapped to hardware at compile time
- Three-tier memory with 432 MB SRAM, 64 GB HBM2E, and up to 2 TB DDR5 is uniquely suited for long-context agentic workloads
- TSMC 3nm process is the most advanced node used by any AI inference ASIC currently announced
- Air-cooled SambaRack design (15-30 kW) avoids the liquid cooling infrastructure required by B200 deployments
- Massive scaleout to 32,768 RDUs enables serving models up to 10 trillion parameters (claimed)
- 10 million token context window support (claimed) addresses a genuine gap in current inference infrastructure
- 2.2 TB/s chip-to-chip switched fabric provides high-bandwidth inter-chip communication
- Strong backing with $350M Series E and SoftBank as first deployment partner
Weaknesses
- Zero independent benchmark validation - all performance claims come from SambaNova's own testing
- Not shipping until H2 2026 - the B200 is available now and next-generation NVIDIA hardware will be on the horizon
- 64 GB HBM2E at 1.8 TB/s is a generation behind the B200's 192 GB HBM3e at 8 TB/s
- SambaFlow is a proprietary software stack with a small ecosystem compared to CUDA
- No disclosed pricing makes TCO analysis impossible for potential buyers
- Per-chip TDP not disclosed - total system power suggests it may be comparable to or higher than B200
- Limited deployment track record - SambaNova's previous-generation chips have a much smaller install base than NVIDIA or even Groq
- Comparison benchmarks use vague baselines ("average GPU provider") rather than specific, reproducible configurations
Related Coverage
- NVIDIA B200 - Blackwell Flagship GPU - The incumbent SambaNova is benchmarking against, with 9,000 TFLOPS sparse FP8 and 192GB HBM3e
- Groq LPU - Deterministic Inference at Scale - Another non-GPU inference ASIC using static scheduling and on-chip SRAM
- NVIDIA H100 SXM - The AI Training Benchmark - The previous-generation GPU still widely used for inference
