D-Matrix's Corsair inference accelerator entered volume production in June 2026, making it one of the few non-NVIDIA inference ASICs to reach actual shipping hardware this year. The chip takes a fundamentally different architectural path from GPU-based alternatives: computation happens where the data already lives, inside SRAM rather than hauling weights back and forth across a memory bus.

TL;DR

SRAM-based in-memory compute ASIC on TSMC 6nm, in volume production as of June 2026 with Microsoft backing
150 TB/s effective memory bandwidth per card via 2GB integrated SRAM - addresses the decode bottleneck that limits GPU-based LLM inference
Independent testing by Gimlet Labs showed 12x latency improvement when paired with GPUs vs GPU-only inference
Targets decode-phase offload paired with existing GPU prefill hardware, not a full GPU replacement

Overview

Corsair is built around a single core insight: LLM inference during the decode phase is starved for memory bandwidth, not compute. A traditional GPU spends most of its inference time loading model weights from HBM into the compute cores on every token step. Corsair sidesteps that by embedding computation directly inside SRAM, where the data already sits. The result is effective bandwidth measured in petabytes per second per rack - orders of magnitude beyond what HBM can deliver.

D-Matrix positions Corsair not as a GPU replacement but as a decode-phase accelerator that pairs with existing GPU hardware. The GPU handles prefill (processing the input prompt, which is compute-intensive), while Corsair takes over token-by-token decode (which is memory-bandwidth-intensive). In independent testing by Gimlet Labs in March 2026, this disaggregated configuration reduced response latency from 24 seconds to under 2 seconds for a representative LLM workload - a 12x improvement.

The chip is backed by Microsoft and ships with a full rack-scale reference design called SquadRack, built in partnership with Arista, Broadcom, and Supermicro. Software support runs through d-Matrix's own Aviator stack and JetStream 400G networking. Air cooling throughout - no liquid cooling required, which matters for data center operators dealing with retrofit constraints.

d-Matrix Corsair platform architecture showing DIMC chiplet layout and memory hierarchy The Corsair platform uses Digital In-Memory Compute (DIMC) chiplets arranged to keep computation close to SRAM, eliminating the bandwidth wall that limits GPU decode throughput. Source: servethehome.com

Key Specifications

Specification	Details
Manufacturer	d-Matrix
Product Family	Corsair
Chip Type	ASIC (DIMC - Digital In-Memory Compute)
Process Node	TSMC N6 (6nm)
Chiplet Config	2 chips × 4 chiplets per card (8 chiplets total)
Performance Memory	2GB SRAM @ 150 TB/s per card
Capacity Memory	Up to 256GB LPDDR5X per card
FP8/MXINT8 Performance	2,400 TFLOPs per card; 19,200 TFLOPs MXINT4 (dual card)
Card Interface	PCIe 5.0 x16
TDP	275W at 800MHz; 550W at 1.2GHz
Power Efficiency	38 TOPS/Watt
Cooling	Air-cooled
Max Cards per Server	8
Scale-out Networking	JetStream 400G Ethernet
Software	Aviator runtime
Release Date	June 2026 (volume production)

In a dual-card configuration, peak compute reaches 4,800 TFLOPs MXINT8 and 19,200 TFLOPs MXINT4 with 300 TB/s of combined bandwidth. At rack scale with 64 cards, the platform delivers 9.6 PB/s of effective bandwidth - a figure that conventional HBM-based accelerators can't match at any price point, since HBM bandwidth scales with chip count rather than with SRAM density.

Performance Benchmarks

D-Matrix has published its own numbers and Gimlet Labs released independent data. Both sets of figures are for decode throughput and latency, not training.

Benchmark	Corsair (SquadRack)	NVIDIA H100 SXM	Improvement (claimed)
Llama 3-8B (8-card server)	60,000 tokens/s	~6,000 tokens/s est.	~10x
Llama 3-70B (64-card rack)	30,000 tokens/s at 2ms	~2,000 tokens/s est.	~15x
Latency reduction (Gimlet Labs)	<2s end-to-end	24s end-to-end	12x
Power efficiency	38 TOPS/W	~7-10 TOPS/W	~4-5x
Llama 3-70B per-token latency	2ms	~15-20ms est.	~7-10x

The Gimlet Labs numbers from March 2026 are the most credible independent data available and predate the production announcement. The d-Matrix-published throughput claims at rack scale are self-reported.

Comparing against the NVIDIA H100: on raw compute, the H100 wins by large margins. On inference decode throughput at constrained latency targets, Corsair's architecture advantage is real. These aren't the same metric, and both matter depending on what you're running.

The Cerebras WSE-3 takes a different approach to the same problem - on-chip SRAM but at wafer scale, with 44GB of SRAM and 125 PB/s bandwidth. Corsair's advantage over WSE-3 is flexibility: it's a standard PCIe card that fits into existing servers.

Key Capabilities

Digital In-Memory Compute (DIMC)

The chiplet architecture runs matrix operations inside the SRAM rather than loading weights into a separate compute engine. Each chiplet carries a 6MB local stash plus access to the card's LPDDR5X capacity memory, with die-to-die bandwidth of ~1 TB/s per chiplet for inter-chiplet communication. The 115ns all-to-all chiplet latency enables tight synchronization across the 8-chiplet card without going off-chip.

This design eliminates the memory wall for decode workloads. A 70B parameter model normally requires over 140GB of weights (in FP8). On a GPU, every decode step re-loads portions of those weights. On Corsair, the weights stay resident in LPDDR5X capacity memory and the compute comes to the data via DIMC. The effective bandwidth advantage compounds at scale.

Disaggregated Inference Architecture

Corsair doesn't try to handle the full inference pipeline. Prefill - processing the prompt - is compute-intensive and maps well to GPU tensor cores. Corsair takes over at the decode stage. This hybrid model means operators keep their existing GPU investment for prefill while offloading the expensive decode phase to Corsair. The 400G JetStream networking connects Corsair decode nodes to GPU prefill nodes with low enough latency to remain transparent to end users.

Air Cooling and Form Factor

The PCIe 5.0 card form factor matters for deployability. A standard 2U or 4U server can hold up to 8 Corsair cards, and air cooling removes the liquid cooling infrastructure that many data centers still lack for new-generation hardware. The 275-550W per-card range, while not trivial, is within the envelope of standard PCIe slot power delivery with external connectors.

Pricing and Availability

D-Matrix describes per-card pricing as "tens of thousands of dollars" without providing specific figures. The company began volume shipping to priority hyperscalers, neoclouds, and frontier AI labs in summer 2026. Broader availability through Supermicro and OEM partners is expected in the second half of 2026.

Cloud-based pricing hasn't been announced. D-Matrix's focus appears to be direct enterprise and hyperscaler sales rather than building its own cloud offering. Microsoft, as an investor, is expected to be an early deployment partner.

For teams assessing alternatives to GPU-only inference: Corsair is available now at production quality, which puts it ahead of most announced inference ASICs. The catch is that the disaggregated architecture requires pairing with existing GPU hardware for prefill - it doesn't eliminate the GPU footprint entirely.

Strengths and Weaknesses

Strengths

In-memory compute directly addresses the decode bandwidth bottleneck that limits GPU inference throughput at scale
TSMC 6nm is proven, low-risk process - not chasing 3nm yield risks
Air-cooled PCIe form factor fits existing data center infrastructure
Independent Gimlet Labs validation (12x latency improvement) before production announcement
SquadRack reference design with Arista/Broadcom/Supermicro reduces integration risk
Disaggregated architecture pairs with existing GPU investment rather than requiring full rack replacement

Weaknesses

All d-Matrix-published throughput benchmarks are self-reported at rack scale; head-to-head GPU comparisons need independent verification
Inference-decode-only: operators still need GPUs for prefill, meaning Corsair adds cost rather than replacing it
Pricing not disclosed beyond "tens of thousands of dollars" - TCO calculations require direct vendor quotes
Small company competing against NVIDIA, AMD, and custom silicon from hyperscalers; long-term supply reliability is a legitimate risk
Ecosystem and software support depth (Aviator, JetStream) is far smaller than CUDA/ROCm

NVIDIA H100 - primary GPU baseline for Corsair's performance claims
Cerebras WSE-3 - wafer-scale alternative taking similar decode-first approach
AWS Trainium3 - hyperscaler custom silicon taking a different angle on inference efficiency
SambaNova SN50 - another inference-focused custom architecture in the market

Sources: