Qualcomm AI250 - Near-Memory Computing for Inference

The Qualcomm AI250 applies near-memory computing to the same 768GB LPDDR5X design as the AI200, promising 10x higher effective memory bandwidth and lower power for LLM inference at rack scale.

Qualcomm AI250 - Near-Memory Computing for Inference

TL;DR

  • Same 768GB LPDDR5X memory as the AI200 but with near-memory computing that Qualcomm claims delivers 10x higher effective bandwidth
  • Notably lower power consumption than AI200 despite the same 160 kW rack envelope
  • Hexagon NPU architecture, confidential computing built-in, PCIe scale-up and Ethernet scale-out
  • Humain (Saudi Arabia, 200MW deployment) is confirmed as an early customer; availability 2027

Overview

The Qualcomm AI250 was announced in October 2025 with the AI200, but the two chips aren't simultaneous releases. The AI200 targets 2026 commercial availability; the AI250 follows in 2027. Both carry the same 768GB of LPDDR5X memory per accelerator card - a figure that dwarfs the 192GB HBM3E on NVIDIA's B200 - but they use fundamentally different memory architectures to serve that capacity.

The AI250's headline innovation is near-memory computing. Instead of fetching data from LPDDR5X into the compute core to execute operations, the AI250 moves compute logic close to the memory arrays themselves. The effect, according to Qualcomm, is an effective memory bandwidth increase of more than 10x compared to the AI200, combined with meaningfully lower power consumption. If that claim holds up under real workloads, it's architecturally significant.

LLM inference bottlenecks on memory bandwidth, not raw TFLOPS. A model with 70 billion parameters at FP16 weighs 140GB; each created token requires reading basically all of those weights from memory once. The bandwidth available to the chip sets the floor on generation speed, regardless of how many multiply-accumulate units the chip has. Qualcomm's bet is that changing where the computation happens - at the memory rather than the compute core - can break this bottleneck more efficiently than adding more HBM bandwidth.

Key Specifications

SpecificationDetails
ManufacturerQualcomm
Product FamilyCloud AI
Chip TypeASIC (Hexagon NPU)
Process NodeNot disclosed
Memory768 GB LPDDR5X per card
Effective Memory Bandwidth10x AI200 (absolute figure not disclosed)
FP8 PerformanceNot disclosed
Architecture InnovationNear-memory computing
SecurityConfidential computing built-in
Scale-up InterconnectPCIe
Scale-out InterconnectEthernet
CoolingDirect liquid cooling (DLC)
Rack Power160 kW
Target WorkloadInference
Availability2027

Performance Benchmarks

Qualcomm hasn't published TOPS or TFLOPS for the AI250. The company doesn't release those numbers for the AI200 either, which makes direct quantitative comparison with NVIDIA or AMD difficult.

MetricQualcomm AI250Qualcomm AI200NVIDIA B200AMD MI300X
Memory Capacity768 GB LPDDR5X768 GB LPDDR5X192 GB HBM3E192 GB HBM3
Effective Memory BW10x AI200Baseline8,000 GB/s5,300 GB/s
FP8 TFLOPSNot disclosedNot disclosed9,0005,300
Rack Power160 kW160 kW~700W/GPU~750W/GPU
CoolingDLCDLCAir or liquidLiquid
Availability20272026AvailableAvailable

The memory capacity comparison is the most concrete: 768GB per card versus 192GB on B200 or MI300X. A single Qualcomm card can hold a 400B+ parameter model at FP16 without requiring model parallelism across multiple cards. An equivalent NVIDIA configuration needs roughly four B200s to hold the same model. For inference serving, fewer cards per model means simpler orchestration and lower power per model instance.

What's still missing: absolute bandwidth numbers. "10x AI200" is meaningful only if we know what AI200's actual bandwidth is - which Qualcomm hasn't published. For buyers assessing the AI250, this is a significant blind spot that won't resolve until third-party benchmarks appear after launch.

Key Capabilities

Near-Memory Computing Architecture. Near-memory computing puts logic circuits inside or with the DRAM die, rather than in a separate processor die connected via a memory bus. For read-heavy workloads like LLM inference, this compresses the data path from several hundred nanoseconds (memory controller round-trip) to much lower latencies. The bandwidth that matters for inference isn't the raw throughput of the memory interface, but how quickly the compute core can consume weights during token generation. Near-memory compute changes that equation by doing work inside the memory subsystem.

This architectural approach isn't new in HPC - Processing-in-Memory (PIM) and near-memory compute have been research topics for decades. Qualcomm's claim is that it has made the approach commercially viable at the scale of a data center inference card. SK Hynix has demonstrated similar technology with its AiMX chip, and Samsung has explored it for HBM. If Qualcomm's implementation is production-quality, it could set a new efficiency benchmark for bandwidth-intensive workloads.

768GB LPDDR5X at Scale. LPDDR5X is a less conventional choice for data center AI than HBM. It's lower bandwidth per pin than HBM, which is why Qualcomm's near-memory compute claim is load-bearing - the architecture has to compensate for the raw bandwidth gap against HBM3E. The upside of LPDDR is cost: LPDDR5X modules are substantially cheaper per gigabyte than HBM3E stacks, which is central to Qualcomm's total cost of ownership argument. A 768GB AI250 card doesn't need the exotic HBM packaging that drives up cost on NVIDIA and AMD products.

Confidential Computing. Both AI200 and AI250 include hardware-level confidential computing features - encryption and isolation for inference workloads. This is increasingly required for enterprise deployments handling sensitive data, and it's built into the Hexagon NPU architecture rather than added as a software layer. For regulated industries (finance, healthcare, government), this matters.

Pricing and Availability

Qualcomm hasn't published pricing. The company's reference point is the 200-megawatt AI infrastructure deployment with Humain in Saudi Arabia - both AI200 and AI250 are confirmed in that deal. The scale of that deployment implies production pricing exists, but hasn't been made public.

Commercial availability is 2027. That's a meaningful delay versus AI200 (2026), NVIDIA Vera Rubin (2026-H2), and AMD MI455X (2026-H2). Buyers assessing inference infrastructure choices for 2026 don't have the AI250 as an option. The AI200 ships in 2026, with AI250 following as a higher-efficiency successor for customers willing to wait.

The 160 kW rack power envelope is identical between AI200 and AI250 despite the claimed power reduction. Qualcomm appears to be using the same rack density but doing more compute or serving more throughput per kilowatt, rather than reducing the total rack footprint.

Strengths and Weaknesses

Strengths

  • 768 GB LPDDR5X per card - 4x the memory capacity of NVIDIA B200 or AMD MI300X
  • Near-memory computing architecture targets the core bottleneck in LLM inference (bandwidth, not compute)
  • Claimed 10x effective memory bandwidth improvement over AI200
  • Lower power consumption than AI200 within the same 160 kW rack envelope
  • Confidential computing built into the Hexagon NPU architecture
  • Humain partnership validates commercial traction at data center scale
  • LPDDR5X cost advantage vs HBM3E could reduce per-card pricing versus HBM alternatives

Weaknesses

  • Availability is 2027 - can't compete for 2026 infrastructure decisions
  • FP8 TOPS not disclosed - impossible to benchmark against published GPU numbers
  • Process node not disclosed - limits architectural analysis
  • Absolute memory bandwidth not disclosed ("10x AI200" is relative to an undisclosed baseline)
  • No track record in production data center inference - AI200 hasn't shipped widely yet
  • LPDDR5X raw bandwidth lower than HBM; near-memory compute must close the gap to compete
  • Qualcomm AI200 - The current-generation counterpart shipping in 2026
  • NVIDIA B200 - The primary HBM3E-based competitor for inference
  • AMD MI300X - AMD's widely deployed 192GB inference accelerator

Sources

✓ Last verified May 1, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.