Hardware

Tenstorrent Blackhole p150a - RISC-V AI Card

Complete specs, benchmarks, and analysis of the Tenstorrent Blackhole p150a - the $1,399 PCIe AI accelerator with 120 Tensix cores, 768 RISC-V processors, 32GB GDDR6, and fully open-source software.

Tenstorrent Blackhole p150a - RISC-V AI Card

TL;DR

  • $1,399 PCIe AI accelerator with 120 Tensix cores and 768 RISC-V processors on TSMC 6nm - the first AI card where every core runs open-source RISC-V
  • 664 TFLOPS BLOCKFP8 and 332 TFLOPS BF16 from 32GB GDDR6 at 512 GB/s bandwidth
  • Fully open-source software stack - TT-Metalium, TT-NN, and TT-Forge published under Apache 2.0
  • 4x QSFP-DD 800G Ethernet ports built into the card for direct chip-to-chip networking without external switches
  • Budget variant p100a at $999 with 28GB and no QSFP-DD; QuietBox workstation with 4x p150c cards at $11,999

Overview

Tenstorrent's Blackhole p150a is the most unusual AI accelerator you can actually buy today. In a market where every serious competitor uses proprietary ISAs, proprietary interconnects, and proprietary software stacks, Tenstorrent has built a high-performance AI card around RISC-V cores with a fully open-source software stack. And they are selling it for $1,399 - roughly the price of an NVIDIA RTX 4090 and a fraction of what any datacenter-class AI accelerator costs.

The Blackhole chip is Tenstorrent's third-generation AI silicon, following Grayskull and Wormhole. It's built on TSMC's 6nm process with about 600mm^2 of die area. The core compute comes from 120 Tensix cores - Tenstorrent's custom compute tile that pairs a matrix engine (for tensor operations) with a vector engine (for element-wise operations) and local SRAM. Each Tensix core contains multiple RISC-V processors (called "Baby RISC-V" cores) that control the compute pipelines. Across the full chip, there are 768 RISC-V processors, making the Blackhole one of the largest RISC-V implementations in any shipping product.

The p150a card ships with 32GB of GDDR6 at 512 GB/s, 300W TDP, and a PCIe 5.0 x16 host interface. What makes it unusual among AI accelerators is the built-in networking: 4x QSFP-DD ports supporting 800G Ethernet each, giving the card 3.2 Tbps of direct chip-to-chip bandwidth without requiring an external network switch. For multi-card deployments, this means you can daisy-chain Blackhole cards directly, avoiding the cost and complexity of InfiniBand or RoCE switching infrastructure.

Tenstorrent is led by Jim Keller, the legendary chip architect behind AMD's Zen architecture, Apple's A-series processors, and Tesla's FSD chip. Keller joined as CEO in 2023 and has been vocal about his belief that AI hardware needs to break free from the CUDA monoculture. The Blackhole is the first product that fully reflects his vision: RISC-V everywhere, open-source everything, and a price point that makes AI hardware accessible to researchers and startups, not just hyperscalers.

Key Specifications

SpecificationDetails
ManufacturerTenstorrent
ArchitectureBlackhole (3rd generation)
Process NodeTSMC 6nm
Die Area~600 mm^2
Tensix Cores120
RISC-V Processors768 (Baby RISC-V)
BLOCKFP8 Compute664 TFLOPS
BF16 Compute332 TFLOPS
FP32 Compute83 TFLOPS
Memory32 GB GDDR6
Memory Bandwidth512 GB/s
Memory Channels8
Host InterfacePCIe 5.0 x16
Networking4x QSFP-DD 800G Ethernet (3.2 Tbps total)
TDP300W
Form FactorFull-height, dual-slot PCIe card
CoolingActive fan (p150a) or passive (p150c for QuietBox)
Price$1,399

Tensix Core Architecture

Each Tensix core is a self-contained compute tile with:

  • Matrix Engine: Handles tensor (matrix multiplication) operations. Supports BLOCKFP8, BF16, FP32, and INT8 precision formats.
  • Vector Engine: Handles element-wise operations, activations, normalization, and other non-GEMM compute. Programmable via RISC-V cores.
  • Local SRAM: Each Tensix core has dedicated SRAM for intermediate data, reducing the need to round-trip to GDDR6 for temporary values.
  • RISC-V Control Processors: Multiple Baby RISC-V cores per Tensix tile manage data movement, scheduling, and compute dispatch.

The Tensix architecture is closer to a network-on-chip design than a traditional GPU. Each core operates semi-independently, with data flowing between cores through an on-chip mesh interconnect. This design enables high use for dataflow-style computation (like transformer inference) where different layers of the model can be mapped to different Tensix cores and data streams through the chip without global synchronization.

Performance Benchmarks

Metricp150a (Blackhole)NVIDIA RTX 4090NVIDIA H100 SXM
FP8 Compute664 TFLOPS (BLOCKFP8)660 TFLOPS (FP8)1,979 TFLOPS (FP8 sparse)
BF16 Compute332 TFLOPS330 TFLOPS990 TFLOPS
Memory32 GB GDDR624 GB GDDR6X80 GB HBM3
Memory Bandwidth512 GB/s1,008 GB/s3,350 GB/s
TDP300W450W700W
Price$1,399~$1,600~$25,000
Networking (built-in)3.2 TbpsNoneNone
SoftwareTT-Forge (open source)CUDACUDA

Raw Compute Context

On paper, the Blackhole p150a matches the RTX 4090 in FP8 and BF16 TFLOPS at a slightly lower price and clearly lower power draw (300W vs 450W). Against the H100, it delivers roughly one-third the FP8 compute at one-eighteenth the price.

The real question is utilization. NVIDIA's CUDA ecosystem has decades of kernel optimization, and frameworks like TensorRT-LLM extract near-peak performance from NVIDIA hardware. Tenstorrent's software stack is younger and less optimized. Early community benchmarks suggest the Blackhole reaches 40-60% of its theoretical TFLOPS on real LLM workloads, compared to 60-80% for NVIDIA GPUs with mature software. This gap is real but narrowing as Tenstorrent's open-source community contributes optimizations.

Networking Advantage

The built-in 3.2 Tbps of Ethernet networking is the p150a's most underappreciated feature. In a multi-GPU training or inference setup, networking is usually the most expensive component after the GPUs themselves. An InfiniBand switch for 8 GPUs can cost $10,000-50,000 depending on the configuration. With the p150a, you connect cards directly via QSFP-DD cables - no switches, no InfiniBand, no additional cost beyond the cables themselves.

For a 4-card setup, the total cost is roughly $5,600 for cards plus $200-400 for cables, versus $6,400+ for four RTX 4090s plus $500-2,000 for networking. For larger deployments, the networking cost savings scale proportionally.

Key Capabilities

Fully Open-Source Software Stack. Tenstorrent's entire software stack is published under Apache 2.0:

  • TT-Metalium: Low-level runtime and kernel library. Direct access to the Tensix cores, memory management, and data movement primitives. Equivalent to CUDA's driver API and PTX assembly.
  • TT-NN: Neural network library with optimized operators for common AI operations (attention, linear, convolution, normalization). Equivalent to cuDNN.
  • TT-Forge: High-level model compiler that ingests PyTorch, ONNX, and other frameworks and compiles them to Blackhole executables. Equivalent to TensorRT.

The open-source approach means anyone can inspect, modify, and optimize the entire stack from framework integration down to individual kernel implementations. This is a fundamentally different model from NVIDIA's closed-source CUDA ecosystem, and it has attracted a growing community of contributors - particularly from the RISC-V and open-hardware communities.

RISC-V Throughout. Every processor on the Blackhole - the 768 Baby RISC-V cores, the management processors, the network controllers - runs the RISC-V instruction set. This matters for two reasons. First, RISC-V is an open ISA with a large ecosystem of compilers, debuggers, and profiling tools. Second, it means Tenstorrent avoids licensing fees to ARM, x86, or any other ISA vendor, keeping the bill of materials lower.

BLOCKFP8 Precision. Tenstorrent uses the BLOCKFP8 format rather than standard FP8 (E4M3/E5M2). BLOCKFP8 groups elements into blocks that share a common exponent, providing better dynamic range coverage for neural network weights and activations than per-element FP8. The 664 TFLOPS figure represents the peak BLOCKFP8 throughput of the matrix engines across all 120 Tensix cores.

Pricing and Availability

The Blackhole lineup launched in early 2025 with three configurations:

ProductPriceMemoryNetworkingCoolingNotes
p150a$1,39932 GB GDDR64x QSFP-DD 800GActive fanFlagship PCIe card
p100a$99928 GB GDDR6NoneActive fanBudget variant, no networking
QuietBox$11,9994x p150c cardsPassiveWorkstation chassis4-card workstation system

The p150a is available direct from Tenstorrent's website and through select channel partners. Lead times have varied from immediate availability to 2-4 week backorders depending on demand cycles.

The Core Count Controversy

The Blackhole chip was originally announced with 140 Tensix cores, but shipped with 120 enabled. Tenstorrent has stated this is a yield optimization decision - disabling 20 cores from the ~600mm^2 die on a 6nm process significantly improves manufacturing yield and allows more chips per wafer to pass quality validation. The 664 TFLOPS specification is based on the 120-core configuration, so the shipped performance matches the specification.

The community reaction was mixed. Some viewed the 14% core reduction as a bait-and-switch; others recognized it as standard practice in the semiconductor industry (NVIDIA has done this with many GPU launches, including the RTX 4090 which ships with 128 of 144 SMs enabled). The key question is whether the 120-core performance meets workload requirements at the $1,399 price point, and for most use cases, it does.

Cost-Per-TFLOPS Analysis

Metricp150aRTX 4090H100 SXM
Price$1,399~$1,600~$25,000
FP8 TFLOPS6646601,979
$/TFLOPS (FP8)$2.11$2.42$12.63
Memory (GB)322480
$/GB Memory$43.72$66.67$312.50

On a pure dollar-per-TFLOPS basis, the p150a is competitive with the RTX 4090 and dramatically cheaper than the H100. The 32GB of memory at $43.72/GB is also the best value in the comparison. The caveat remains software maturity - reaching those TFLOPS on real workloads depends on the quality of kernel implementations in TT-NN.

Strengths

  • $1,399 price point makes datacenter-class AI compute accessible to individual researchers, startups, and universities
  • Fully open-source software stack (Apache 2.0) enables community optimization and removes vendor lock-in
  • Built-in 3.2 Tbps Ethernet networking eliminates the need for expensive InfiniBand switches in multi-card deployments
  • 768 RISC-V processors make this the largest shipping RISC-V implementation in AI hardware
  • 300W TDP is power-efficient relative to the compute delivered - fits in standard PCIe server chassis
  • 32GB GDDR6 is more memory than the RTX 4090's 24GB at a lower price
  • Jim Keller's leadership brings credibility and a track record of successful chip architectures
  • BLOCKFP8 precision format provides better dynamic range than standard FP8 for neural network workloads

Weaknesses

  • Software stack maturity lags NVIDIA's CUDA ecosystem by years - real-world use is 40-60% vs NVIDIA's 60-80%
  • 512 GB/s memory bandwidth is half the RTX 4090's 1,008 GB/s - bandwidth-bound workloads (LLM decode) will underperform
  • Core count reduction from 140 to 120 Tensix cores raised community trust concerns
  • TSMC 6nm is two full nodes behind the 3nm/4nm silicon used by NVIDIA B200 and AWS Trainium3
  • Limited model coverage - not all popular models have optimized TT-NN kernels yet
  • GDDR6 (not HBM) limits bandwidth-to-compute ratio for memory-bound workloads
  • No equivalent of TensorRT-LLM's advanced serving features (continuous batching, speculative decoding) in TT-Forge yet
  • Small company risk - Tenstorrent is privately funded and competing against NVIDIA, AMD, and cloud providers with vastly more resources

Sources

Tenstorrent Blackhole p150a - RISC-V AI Card
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.