Hardware

Qualcomm AI200 - Rack-Scale Inference ASIC

Qualcomm AI200 specs and analysis - a Hexagon-based inference accelerator with 768GB LPDDR per card, rack-scale design, and a focus on inference TCO.

Qualcomm AI200 - Rack-Scale Inference ASIC

TL;DR

  • Qualcomm's first dedicated data center AI accelerator, built on the Hexagon NPU architecture
  • 768 GB LPDDR5X per card - far more memory capacity per card than any HBM-based accelerator
  • Rack-scale design with direct liquid cooling at 160 kW per rack
  • Per-chip performance numbers (TOPS, TFLOPS, bandwidth) remain undisclosed
  • HUMAIN partnership will deploy 200 MW of AI200-based racks starting in 2026
  • Successor AI250 planned for 2027 with near-memory computing architecture

Overview

Qualcomm has controlled mobile AI for years, but the Cloud AI 100 Ultra was its only data center product - a multi-SoC card reusing Hexagon NPU cores from Snapdragon phones. The AI200 is a different proposition. Announced October 27, 2025, it is Qualcomm's first purpose-built rack-scale inference product, designed to compete with NVIDIA's H100 and B200 on total cost of ownership for LLM serving.

The pitch: use LPDDR5X instead of HBM to pack far more memory per card at lower cost. At 768 GB per card, the AI200 offers roughly 10x the memory of an H100 (80 GB HBM3) and 4.8x the capacity of a B200 (192 GB HBM3e). That advantage matters for inference, where fitting model weights into memory is the primary constraint, even if per-chip throughput is likely substantially lower than GPUs.

This is a similar bet to Intel's Crescent Island approach, which also uses LPDDR instead of HBM. Qualcomm goes further with more memory per card and a complete rack system with integrated liquid cooling. The question is whether the TCO math works when you may need 2-6x more racks than an equivalent GPU deployment to match throughput.

Key Specifications

SpecificationDetails
ManufacturerQualcomm
Product FamilyCloud AI
Chip TypeASIC (Hexagon NPU, data center variant)
ArchitectureHexagon (likely 7th generation)
Process NodeTSMC 3nm (reported, not confirmed)
Memory per Card768 GB LPDDR5X
Per-Chip SpecsNot disclosed (bandwidth, TOPS, TFLOPS, TDP, core count)
Rack Power160 kW
CoolingDirect liquid cooling
InterconnectPCIe (scale-up) + Ethernet (scale-out)
Target WorkloadInference (LLM/LMM)
PredecessorCloud AI 100 Ultra
Availability2026
PricingNot disclosed

That's a lot of blank entries. Qualcomm has shared macro-level rack specs but nothing at the chip level - no per-chip performance, power, core count, clock speeds, or pricing. NextPlatform's Timothy Prickett Morgan attempted to back-calculate figures from the HUMAIN deployment, but those remain estimates.

Performance Analysis

Without published TOPS, TFLOPS, or memory bandwidth numbers, there are no direct benchmarks to cite. What we can analyze is the architectural trade-off.

The LPDDR calculus. LPDDR5X is cheaper per gigabyte than HBM but also significantly slower. An H100 delivers roughly 3,350 GB/s from HBM3. LPDDR5X channels run at 8,533 MT/s each, and Qualcomm hasn't shared how many channels are bonded to each SoC. The aggregate bandwidth per card is almost certainly a fraction of HBM-based accelerators.

For inference this matters less than you might expect. A single AI200 card with 768 GB could hold a 400B+ parameter model at FP16 - something requiring 5 H100s or 3 B200s. Fewer cards means less inter-card communication and simpler serving. The trade-off: each card generates tokens more slowly, and Qualcomm's bet is that memory cost savings outweigh the throughput gap on a per-dollar basis.

Analysts estimate an AI200 deployment would need 2-6x more racks than GPU equivalents for the same throughput. At 160 kW per rack, that power delta adds up. Whether cheaper hardware offsets higher power and floor space costs is the question Qualcomm has not answered.

Key Capabilities

Hexagon NPU Architecture. The AI200 is built on Qualcomm's Hexagon NPU, the same core architecture that powers on-device inference in Snapdragon phones. The data center variant is likely 7th generation Hexagon, scaled up from the Cloud AI 100 Ultra (which used 4 SoCs per card, 16 Hexagon cores each). Qualcomm has not disclosed the AI200's SoC count or core count. Unlike GPUs, the Hexagon architecture is a purpose-built inference engine optimized for matrix multiplication, convolution, and attention at INT8 and FP8 precision.

LPDDR Economics. HBM3e runs $10-15 per GB. LPDDR5X costs $2-4 per GB. For 768 GB, that means roughly $1,500-$3,000 in memory cost per card versus $7,600-$11,500 for the same capacity in HBM3e - and no accelerator ships with that much HBM anyway. This is the foundation of the TCO argument. The risk: customers buy bandwidth, not just capacity. If the AI200 can't sustain competitive token generation rates, cheaper memory doesn't help.

Rack-Scale Design and Liquid Cooling. The AI200 ships as a complete rack-scale system with integrated direct liquid cooling, not as individual add-in cards. The 160 kW rack power envelope is comparable to NVIDIA's GB200 NVL72 systems. PCIe provides scale-up within the rack, Ethernet handles scale-out. Direct liquid cooling is a practical necessity at this power density - operators without existing liquid cooling infrastructure will face additional deployment cost.

HUMAIN Deployment. The most concrete validation of the AI200 is the HUMAIN deal - 200 MW of AI200-based racks in Saudi Arabia starting in 2026. At 160 kW per rack, that translates to roughly 1,250 racks. This is a massive commercial commitment, not a proof-of-concept pilot. HUMAIN selected the AI200 for sovereign AI workloads where inference cost efficiency matters more than peak throughput.

Software Stack. Qualcomm has announced a hyperscaler-grade software stack with one-click model deployment and Hugging Face integration. The software story will be critical - enterprise customers won't migrate from CUDA unless the tooling is comparable in maturity. The Hugging Face integration lowers the barrier for model deployment on unfamiliar hardware.

Pricing and Availability

Qualcomm hasn't disclosed pricing. The AI200 is scheduled for 2026 availability, with HUMAIN as the anchor deployment. The AI250 successor is planned for 2027 with a "near-memory computing architecture" claiming over 10x effective memory bandwidth - which would address the AI200's primary weakness if it delivers.

TimelineProductStatus
2024Cloud AI 100 UltraShipping
2026AI200Announced, availability 2026
2027AI250Announced, near-memory computing

Strengths

  • 768 GB LPDDR5X per card - far more memory capacity at a fraction of HBM cost
  • Turnkey rack-scale system with integrated liquid cooling
  • Purpose-built for inference - no paying for GPU training silicon you don't need
  • HUMAIN 200 MW deal validates commercial demand
  • LPDDR supply not constrained by the HBM bottleneck limiting GPU availability
  • TSMC 3nm (if confirmed) puts the silicon on a competitive node

Weaknesses

  • Per-chip performance completely undisclosed - no TOPS, no TFLOPS, no bandwidth figures
  • LPDDR5X bandwidth far lower than HBM3/HBM3e, potentially limiting token generation speed
  • May require 2-6x more racks than GPU equivalents for the same throughput
  • No training capability whatsoever - inference only
  • No established data center ecosystem - CUDA dominance is a real adoption barrier
  • No independent benchmarks or MLPerf submissions as of early 2026
  • Pricing undisclosed, making TCO claims unverifiable
  • NVIDIA H100 - The incumbent data center GPU with 80 GB HBM3
  • NVIDIA B200 - Next-generation Blackwell GPU with 192 GB HBM3e
  • Groq LPU - Another non-GPU inference ASIC, using on-chip SRAM instead of external memory

Sources

Qualcomm AI200 - Rack-Scale Inference ASIC
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.