d-Matrix Corsair - In-Memory Inference Accelerator
d-Matrix Corsair is an SRAM-based in-memory compute ASIC in production since June 2026, targeting 10x faster and 5x more power-efficient LLM inference vs GPU baselines.

D-Matrix's Corsair inference accelerator entered volume production in June 2026, making it one of the few non-NVIDIA inference ASICs to reach actual shipping hardware this year. The chip takes a fundamentally different architectural path from GPU-based alternatives: computation happens where the data already lives, inside SRAM rather than hauling weights back and forth across a memory bus.
TL;DR
- SRAM-based in-memory compute ASIC on TSMC 6nm, in volume production as of June 2026 with Microsoft backing
- 150 TB/s effective memory bandwidth per card via 2GB integrated SRAM - addresses the decode bottleneck that limits GPU-based LLM inference
- Independent testing by Gimlet Labs showed 12x latency improvement when paired with GPUs vs GPU-only inference
- Targets decode-phase offload paired with existing GPU prefill hardware, not a full GPU replacement
Overview
Corsair is built around a single core insight: LLM inference during the decode phase is starved for memory bandwidth, not compute. A traditional GPU spends most of its inference time loading model weights from HBM into the compute cores on every token step. Corsair sidesteps that by embedding computation directly inside SRAM, where the data already sits. The result is effective bandwidth measured in petabytes per second per rack - orders of magnitude beyond what HBM can deliver.
D-Matrix positions Corsair not as a GPU replacement but as a decode-phase accelerator that pairs with existing GPU hardware. The GPU handles prefill (processing the input prompt, which is compute-intensive), while Corsair takes over token-by-token decode (which is memory-bandwidth-intensive). In independent testing by Gimlet Labs in March 2026, this disaggregated configuration reduced response latency from 24 seconds to under 2 seconds for a representative LLM workload - a 12x improvement.
The chip is backed by Microsoft and ships with a full rack-scale reference design called SquadRack, built in partnership with Arista, Broadcom, and Supermicro. Software support runs through d-Matrix's own Aviator stack and JetStream 400G networking. Air cooling throughout - no liquid cooling required, which matters for data center operators dealing with retrofit constraints.
The Corsair platform uses Digital In-Memory Compute (DIMC) chiplets arranged to keep computation close to SRAM, eliminating the bandwidth wall that limits GPU decode throughput.
Source: servethehome.com
Key Specifications
| Specification | Details |
|---|---|
| Manufacturer | d-Matrix |
| Product Family | Corsair |
| Chip Type | ASIC (DIMC - Digital In-Memory Compute) |
| Process Node | TSMC N6 (6nm) |
| Chiplet Config | 2 chips × 4 chiplets per card (8 chiplets total) |
| Performance Memory | 2GB SRAM @ 150 TB/s per card |
| Capacity Memory | Up to 256GB LPDDR5X per card |
| FP8/MXINT8 Performance | 2,400 TFLOPs per card; 19,200 TFLOPs MXINT4 (dual card) |
| Card Interface | PCIe 5.0 x16 |
| TDP | 275W at 800MHz; 550W at 1.2GHz |
| Power Efficiency | 38 TOPS/Watt |
| Cooling | Air-cooled |
| Max Cards per Server | 8 |
| Scale-out Networking | JetStream 400G Ethernet |
| Software | Aviator runtime |
| Release Date | June 2026 (volume production) |
In a dual-card configuration, peak compute reaches 4,800 TFLOPs MXINT8 and 19,200 TFLOPs MXINT4 with 300 TB/s of combined bandwidth. At rack scale with 64 cards, the platform delivers 9.6 PB/s of effective bandwidth - a figure that conventional HBM-based accelerators can't match at any price point, since HBM bandwidth scales with chip count rather than with SRAM density.
Performance Benchmarks
D-Matrix has published its own numbers and Gimlet Labs released independent data. Both sets of figures are for decode throughput and latency, not training.
| Benchmark | Corsair (SquadRack) | NVIDIA H100 SXM | Improvement (claimed) |
|---|---|---|---|
| Llama 3-8B (8-card server) | 60,000 tokens/s | ~6,000 tokens/s est. | ~10x |
| Llama 3-70B (64-card rack) | 30,000 tokens/s at 2ms | ~2,000 tokens/s est. | ~15x |
| Latency reduction (Gimlet Labs) | <2s end-to-end | 24s end-to-end | 12x |
| Power efficiency | 38 TOPS/W | ~7-10 TOPS/W | ~4-5x |
| Llama 3-70B per-token latency | 2ms | ~15-20ms est. | ~7-10x |
The Gimlet Labs numbers from March 2026 are the most credible independent data available and predate the production announcement. The d-Matrix-published throughput claims at rack scale are self-reported.
Comparing against the NVIDIA H100: on raw compute, the H100 wins by large margins. On inference decode throughput at constrained latency targets, Corsair's architecture advantage is real. These aren't the same metric, and both matter depending on what you're running.
The Cerebras WSE-3 takes a different approach to the same problem - on-chip SRAM but at wafer scale, with 44GB of SRAM and 125 PB/s bandwidth. Corsair's advantage over WSE-3 is flexibility: it's a standard PCIe card that fits into existing servers.
Key Capabilities
Digital In-Memory Compute (DIMC)
The chiplet architecture runs matrix operations inside the SRAM rather than loading weights into a separate compute engine. Each chiplet carries a 6MB local stash plus access to the card's LPDDR5X capacity memory, with die-to-die bandwidth of ~1 TB/s per chiplet for inter-chiplet communication. The 115ns all-to-all chiplet latency enables tight synchronization across the 8-chiplet card without going off-chip.
This design eliminates the memory wall for decode workloads. A 70B parameter model normally requires over 140GB of weights (in FP8). On a GPU, every decode step re-loads portions of those weights. On Corsair, the weights stay resident in LPDDR5X capacity memory and the compute comes to the data via DIMC. The effective bandwidth advantage compounds at scale.
Disaggregated Inference Architecture
Corsair doesn't try to handle the full inference pipeline. Prefill - processing the prompt - is compute-intensive and maps well to GPU tensor cores. Corsair takes over at the decode stage. This hybrid model means operators keep their existing GPU investment for prefill while offloading the expensive decode phase to Corsair. The 400G JetStream networking connects Corsair decode nodes to GPU prefill nodes with low enough latency to remain transparent to end users.
Air Cooling and Form Factor
The PCIe 5.0 card form factor matters for deployability. A standard 2U or 4U server can hold up to 8 Corsair cards, and air cooling removes the liquid cooling infrastructure that many data centers still lack for new-generation hardware. The 275-550W per-card range, while not trivial, is within the envelope of standard PCIe slot power delivery with external connectors.
Pricing and Availability
D-Matrix describes per-card pricing as "tens of thousands of dollars" without providing specific figures. The company began volume shipping to priority hyperscalers, neoclouds, and frontier AI labs in summer 2026. Broader availability through Supermicro and OEM partners is expected in the second half of 2026.
Cloud-based pricing hasn't been announced. D-Matrix's focus appears to be direct enterprise and hyperscaler sales rather than building its own cloud offering. Microsoft, as an investor, is expected to be an early deployment partner.
For teams assessing alternatives to GPU-only inference: Corsair is available now at production quality, which puts it ahead of most announced inference ASICs. The catch is that the disaggregated architecture requires pairing with existing GPU hardware for prefill - it doesn't eliminate the GPU footprint entirely.
Strengths and Weaknesses
Strengths
- In-memory compute directly addresses the decode bandwidth bottleneck that limits GPU inference throughput at scale
- TSMC 6nm is proven, low-risk process - not chasing 3nm yield risks
- Air-cooled PCIe form factor fits existing data center infrastructure
- Independent Gimlet Labs validation (12x latency improvement) before production announcement
- SquadRack reference design with Arista/Broadcom/Supermicro reduces integration risk
- Disaggregated architecture pairs with existing GPU investment rather than requiring full rack replacement
Weaknesses
- All d-Matrix-published throughput benchmarks are self-reported at rack scale; head-to-head GPU comparisons need independent verification
- Inference-decode-only: operators still need GPUs for prefill, meaning Corsair adds cost rather than replacing it
- Pricing not disclosed beyond "tens of thousands of dollars" - TCO calculations require direct vendor quotes
- Small company competing against NVIDIA, AMD, and custom silicon from hyperscalers; long-term supply reliability is a legitimate risk
- Ecosystem and software support depth (Aviator, JetStream) is far smaller than CUDA/ROCm
Related Coverage
- NVIDIA H100 - primary GPU baseline for Corsair's performance claims
- Cerebras WSE-3 - wafer-scale alternative taking similar decode-first approach
- AWS Trainium3 - hyperscaler custom silicon taking a different angle on inference efficiency
- SambaNova SN50 - another inference-focused custom architecture in the market
Sources:
- d-Matrix Corsair enters full production - d-Matrix
- d-Matrix Corsair enters full production - PR Newswire
- d-Matrix Corsair In-Memory Computing at Hot Chips 2025 - ServeTheHome
- Chip startup d-Matrix raises $275M to speed up inference with in-memory compute - SiliconANGLE
- d-Matrix Corsair AI Platform - d-Matrix
- d-Matrix Corsair technical whitepaper - d-Matrix
- D-Matrix Claims Corsair Chip Outperforms Nvidia GPUs - Crypto Briefing
✓ Last verified July 1, 2026
