Hardware

Huawei Ascend 910B

Huawei Ascend 910B specs, benchmarks, and real-world performance. 64GB HBM2e, ~1,200 GB/s bandwidth, ~600 TFLOPS FP16 - the chip that trained DeepSeek.

Huawei Ascend 910B

TL;DR

  • Huawei's workhorse AI accelerator - the chip that reportedly powered DeepSeek V3's training alongside NVIDIA hardware
  • 64GB HBM2e memory with ~1,200 GB/s bandwidth and an estimated ~600 TFLOPS FP16 compute
  • Built on SMIC's 7nm process - designed to be manufacturable entirely within China under US sanctions
  • 400W TDP makes it significantly more power-efficient than Western alternatives at the cost of lower peak performance
  • Estimated 50,000+ chips deployed across Chinese hyperscalers and government AI projects

Overview

The Huawei Ascend 910B is the chip that answered the question everyone was asking after the US imposed export controls on advanced AI hardware: can China build its own AI accelerators that are good enough to train frontier models? The answer, demonstrated by DeepSeek and others throughout 2024-2025, is yes - with caveats. The 910B is not an H100 killer. It has roughly 60% of the H100's compute, 36% of its memory bandwidth, and a software ecosystem that is years behind CUDA. But it works, it is available in volume, and it has trained models that compete with the best in the world.

Released in H2 2023, the 910B was Huawei's first production AI accelerator after the October 2022 US export controls blocked Chinese companies from purchasing NVIDIA's A100 and H100 chips. The timing was not coincidental - Huawei had been developing Ascend silicon since 2018, and the export controls transformed what was a nice-to-have domestic alternative into a strategic necessity. The chip packs 64GB of HBM2e memory and delivers an estimated 600 TFLOPS of FP16 compute, placing it roughly between the NVIDIA A100 and H100 in performance.

What makes the 910B historically significant is DeepSeek. Multiple reports indicate that DeepSeek used Ascend 910B chips alongside pre-sanctions NVIDIA hardware to train their V3 and V3.1 models. The V3 technical report describes training on 2,048 NVIDIA H800 GPUs (export-compliant variants of the H100), but DeepSeek's broader infrastructure reportedly includes significant Ascend capacity. The company's decision to optimize DeepSeek V4 specifically for Ascend hardware rather than NVIDIA GPUs is a direct extension of the experience gained running production workloads on the 910B. In this sense, the 910B is the chip that bootstrapped China's independent AI hardware ecosystem.

Key Specifications

SpecificationDetails
ManufacturerHuawei (HiSilicon)
Product FamilyAscend 910
ArchitectureDa Vinci
Process NodeSMIC 7nm (N+2)
Chip TypeASIC
AI Cores32 Da Vinci cores (estimated)
FP16 Performance~600 TFLOPS (estimated)
BF16 Performance~600 TFLOPS (estimated)
INT8 Performance~1,200 TOPS (estimated)
Memory64GB HBM2e
Memory Stacks4x HBM2e (estimated)
Memory Bandwidth~1,200 GB/s
InterconnectHCCS (Huawei Cache Coherence System)
TDP400W
Form FactorProprietary module
Software StackCANN (Compute Architecture for Neural Networks)
Target WorkloadTraining and Inference
Release DateH2 2023
Estimated Price$8,000-$12,000

Note: Huawei does not publish detailed specifications for the Ascend 910B. Performance figures are based on third-party analysis, leaked testing data, and industry estimates. Actual specifications may differ.

Performance Benchmarks (Estimated)

Benchmark / MetricAscend 910BNVIDIA H100 SXMNVIDIA A100 80GBAscend 910C
FP16 Peak (TFLOPS)~600990312~800
Memory Capacity64GB80GB80GB96GB
Memory Bandwidth~1,200 GB/s3,350 GB/s2,039 GB/s~1,800 GB/s
LLM Inference (relative to H100)~0.4-0.5x1.0x (baseline)~0.5-0.6x~0.6-0.7x
Training Throughput (relative to H100)~0.3-0.4x1.0x (baseline)~0.5x~0.5-0.6x
Power (TDP)400W700W400W600W
Power Efficiency (TFLOPS/W)~1.5~1.4~0.78~1.33
Price (estimated)$8,000-$12,000$25,000-$40,000$10,000-$15,000$12,000-$18,000

The 910B's performance relative to the H100 is often characterized as "roughly half," but that oversimplifies the picture. For compute-bound training on large batch sizes, the gap is closer to 2.5-3x because the H100's higher compute and memory bandwidth compound. For smaller models and memory-capacity-bound inference where 64GB is sufficient, the gap narrows to 1.5-2x. The workload-dependent performance variation is wider on the 910B than on NVIDIA hardware because CANN's operator library is less comprehensively optimized.

The power efficiency story is interesting. At 400W TDP with ~600 TFLOPS FP16, the 910B achieves approximately 1.5 TFLOPS per watt - slightly better than the H100's ~1.4 TFLOPS per watt. This is a consequence of the lower clock speeds and SMIC 7nm's power characteristics. For data center operators where electricity cost is a meaningful factor, the 910B's efficiency partially offsets its lower absolute performance. Over a multi-year deployment, the TCO calculation can be closer than the raw performance numbers suggest.

Key Capabilities

DeepSeek Training Validation. The 910B's strongest credential is that it has been used to train models that compete with the best in the world. DeepSeek's V3 family of models, which achieve frontier-level performance on numerous benchmarks, were trained on infrastructure that includes Ascend 910B chips. This is not a synthetic benchmark result or a marketing claim - it is a real-world demonstration that the 910B can handle the demands of frontier model training when combined with sufficient scale and software optimization. The experience DeepSeek gained on the 910B directly informed their decision to build V4 primarily for Ascend hardware.

CANN Ecosystem Foundation. The 910B is where Huawei's CANN software stack had to grow up. Early CANN releases were rough - limited operator coverage, frequent compatibility issues with PyTorch, and debugging tools that were primitive compared to CUDA's ecosystem. But the pressure of real production workloads from DeepSeek, Baidu, and others forced rapid iteration. By the time the 910C launched, CANN had matured to the point where frontier model training was feasible without heroic engineering effort. The 910B bore the cost of that maturation process, and current 910B deployments benefit from the improvements driven by early adopter pain.

Volume Deployment at Scale. Estimates suggest 50,000 or more Ascend 910B chips have been deployed across China's AI infrastructure. This installed base matters because it creates a self-reinforcing ecosystem - more chips deployed means more developers writing CANN code, which means better libraries and tools, which makes the next deployment easier. Chinese cloud providers including Huawei Cloud, Alibaba Cloud, and Baidu Cloud offer 910B-based instances for AI training. Government-funded computing centers and research institutions have also adopted the 910B as part of China's strategic push for AI hardware independence.

Pricing and Availability

The Ascend 910B is estimated to cost between $8,000 and $12,000 per accelerator, though actual pricing varies based on volume, customer relationship, and government subsidies. Huawei does not publish official MSRPs. The 910B is available through Huawei's Atlas 800 and Atlas 900 server platforms, as well as through Chinese cloud providers.

AcceleratorEstimated PriceMemoryFP16 TFLOPSTDPPrice per TFLOP
Huawei Ascend 910B$8,000-$12,00064GB~600400W$13-$20
Huawei Ascend 910C$12,000-$18,00096GB~800600W$15-$23
NVIDIA A100 80GB$10,000-$15,00080GB312400W$32-$48
NVIDIA H100 SXM$25,000-$40,00080GB990700W$25-$40
AMD MI300X$10,000-$15,000192GB1,307750W$8-$11

On a price-per-TFLOP basis, the 910B is competitive with the NVIDIA A100 and significantly cheaper than the H100. However, this metric does not account for the software ecosystem overhead - achieving those theoretical TFLOPS on the 910B requires CANN optimization work that is not necessary on CUDA. The effective price-per-useful-TFLOP is higher than the hardware numbers suggest, particularly for organizations porting existing CUDA workloads.

The 910B is effectively unavailable outside of China. US export controls and Huawei's entity listing mean that the chip cannot be exported to most markets. Within China, availability has been reasonably good, though Huawei prioritizes large orders from strategic customers.

Architecture Deep Dive

The Ascend 910B's Da Vinci architecture was designed to maximize AI compute throughput on a process node that most Western chip designers would consider two generations behind. Understanding how Huawei extracted competitive performance from SMIC 7nm requires looking at the architectural trade-offs that HiSilicon made.

Da Vinci Core Design. Each Da Vinci core in the 910B is built around a 3D Cube Computing Engine - Huawei's name for a systolic array structure that performs dense matrix multiplication. The cube operates on 16x16x16 matrices per cycle, with native FP16, BF16, and INT8 data type support.

Da Vinci Core ComponentFunction910B Specification (est.)
Cube Unit16x16x16 matrix multiply~18.75 TFLOPS FP16 per core
Vector UnitActivation, normalization, element-wise ops128-bit vector width (est.)
Scalar UnitControl flow, index computationStandard RISC pipeline
Unified BufferOn-chip SRAM scratch pad~256KB per core (est.)
Total CoresFull chip32 Da Vinci cores (est.)

The per-core FP16 throughput of approximately 18.75 TFLOPS x 32 cores yields the chip-level ~600 TFLOPS FP16 estimate. By comparison, an NVIDIA H100's SM (Streaming Multiprocessor) delivers roughly 12.4 TFLOPS FP16 across 132 SMs for 990 TFLOPS total (excluding sparsity). The 910B achieves fewer total TFLOPS with fewer but wider cores - a design choice that simplifies the on-chip interconnect and reduces control overhead at the cost of fewer independent execution units.

Memory Subsystem. The 910B's 64GB HBM2e memory is organized in 4 stacks:

Memory PropertyAscend 910BNVIDIA A100 80GBNVIDIA H100 SXM
Total Capacity64GB80GB80GB
HBM GenerationHBM2eHBM2eHBM3
Number of Stacks4 (est.)55
Per-Stack Bandwidth~300 GB/s~408 GB/s~670 GB/s
Aggregate Bandwidth~1,200 GB/s2,039 GB/s3,350 GB/s
Bandwidth-to-Compute Ratio2.0 GB/s per TFLOP6.5 GB/s per TFLOP3.4 GB/s per TFLOP

The bandwidth-to-compute ratio tells an important story. At 2.0 GB/s per TFLOP, the 910B is more compute-bound relative to its memory system than either the A100 or H100. This means the compute engine is often waiting for data, particularly during inference where memory bandwidth is the primary bottleneck. For training with large batch sizes (where compute dominates), the ratio matters less because the matrix multiplications are large enough to keep the cube units busy.

HCCS Interconnect Architecture. Huawei's HCCS (Huawei Cache Coherence System) connects multiple 910B chips for distributed workloads:

HCCS Property910B (est.)Notes
Bidirectional Bandwidth per Link~56 GB/s (est.)Lower than NVLink's ~50 GB/s per link but fewer total links
Total Links per Chip~6 (est.)Configurable per platform
Aggregate Bidirectional Bandwidth~336 GB/s (est.)vs. 900 GB/s for NVLink 4.0
Max Chips per Node (Atlas 900)8Similar to DGX H100
Inter-node CommunicationRoCE v2 (RDMA over Converged Ethernet)No InfiniBand option

The ~336 GB/s aggregate interconnect bandwidth (estimated) is roughly 37% of NVLink 4.0's 900 GB/s. This gap is the 910B's most significant architectural limitation for large-scale training, where all-reduce operations across many chips require high bandwidth and low latency. DeepSeek's training innovations on Ascend hardware focused heavily on algorithmic techniques to reduce communication volume - gradient compression, communication-computation overlap, and asynchronous all-reduce - specifically to work around this bottleneck.

SMIC 7nm Manufacturing. The 910B is manufactured on SMIC's N+2 process, which is SMIC's most advanced production node. Key characteristics:

Process PropertySMIC N+2 (7nm DUV)TSMC N7 (7nm EUV)Difference
LithographyDUV multi-patterningEUV single-patterningHigher cost, lower yield on SMIC
Transistor Density~85-90 MTr/mm2 (est.)~96 MTr/mm2~10% lower on SMIC
Metal LayersComparableComparableSimilar backend
Defect DensityHigher (DUV multi-patterning)LowerLower yield on SMIC
Wafer CostHigher per good dieLower per good die~30-50% cost premium (est.)

The DUV multi-patterning approach requires exposing each critical layer multiple times with different masks, aligned with nanometer precision. This increases cost per wafer and reduces yield compared to EUV single-patterning. SMIC has reportedly achieved acceptable yields for the 910B die size, but the process is inherently more expensive per transistor than TSMC's equivalent node.

Atlas Server Platform. The 910B is deployed through Huawei's Atlas server line:

PlatformConfigurationTotal HBMPowerTarget
Atlas 800 (Model 9000)8x 910B512GB~3,200W (GPUs)Training clusters
Atlas 800 (Model 9010)8x 910B512GB~3,200W (GPUs)Inference serving
Atlas 300T Training Card1x 910B64GB~400WPCIe add-in card
Atlas 900 PoD64x 910B4TB~25.6kW (GPUs)Large-scale training

The Atlas 800 with 8x 910B provides 512GB of aggregate HBM2e - sufficient to hold a 70B parameter model in BF16 across the 8 chips with tensor parallelism. The Atlas 900 PoD configuration connects 8 Atlas 800 nodes for a total of 64 chips with 4TB of aggregate memory, suitable for training models up to approximately 200B parameters with appropriate parallelism strategies.

Software Stack Architecture. The CANN software stack is layered, with each level providing increasing abstraction:

CANN LayerFunctionEquivalent in CUDA Ecosystem
Ascend Hardware DriverLow-level device controlNVIDIA kernel driver
ACL (Ascend Computing Language)Runtime API, memory managementCUDA Runtime API
AscendCL OperatorsOptimized kernels for common operationscuDNN, cuBLAS
CANN Fusion EngineGraph optimization, operator fusionTensorRT graph optimizer
Framework AdaptersPyTorch/TF/MindSpore integrationPyTorch CUDA backend
MindSporeHuawei's native AI frameworkNo direct equivalent

The CANN stack has improved significantly since its initial release with the 910B. Early versions required MindSpore as the primary framework, but current CANN releases provide a mature PyTorch adapter that handles most standard operations. The remaining gaps are in specialized operators - custom attention implementations, sparse operations, and quantization-specific kernels that have been extensively tuned for CUDA but receive less optimization effort on CANN.

Real-World Performance Analysis

DeepSeek Training Performance. The 910B's most important performance data comes from its role in DeepSeek's training infrastructure. While DeepSeek's V3 technical report primarily describes training on 2,048 NVIDIA H800 GPUs, multiple industry sources confirm that DeepSeek also operates significant Ascend 910B clusters for experimentation, data processing, and model development.

Training MetricAscend 910B (est.)NVIDIA H800 (measured, DeepSeek V3 report)Ratio
Model FLOPs Utilization (MFU)~35-42%~54-58%0.65-0.72x
Tokens/sec/chip (DeepSeek V3 scale)~2,000-2,500 (est.)~4,200-4,800~0.5x
Training days for 14.8T tokens (2K chips)~75-100 days (est.)~55 days (reported)~1.4-1.8x longer
Communication overhead (fraction of step time)~25-35% (est.)~15-20%Higher on Ascend

The Model FLOPs Utilization gap (35-42% vs 54-58%) is telling. It reflects both the HCCS bandwidth limitation (more time spent on communication) and the CANN software stack's lower optimization level compared to CUDA (less efficient kernel execution). The combination means that achieving comparable training throughput on 910B requires approximately 2x the chip count of H800/H100 - which is a viable strategy if the chips are available at lower cost per unit.

Inference Serving in Production. Chinese cloud providers and AI companies have deployed 910B chips for production inference serving. Approximate performance numbers based on reported benchmarks:

ModelPrecision910B (single chip)H100 SXM (single chip)Notes
ChatGLM-6BFP16~55-70 tok/s~150-200 tok/sSmall model, compute-limited
Qwen-14BINT8~25-35 tok/s~70-90 tok/sMid-size model
Baichuan 2-13BINT8~28-38 tok/s~75-95 tok/sPopular Chinese model
Yi-34BINT8 (~34GB)~15-20 tok/s~55-70 tok/sFits in 64GB
Llama 70B (INT8)INT8 (~70GB)N/A (OOM)~40-50 tok/sExceeds 910B's 64GB

The 64GB memory constraint is the defining limitation for 910B inference. Any model that exceeds 64GB in its target precision - which includes all 70B+ models at INT8 - requires multi-chip tensor parallelism on the 910B, with associated HCCS communication overhead. The 910C's 96GB upgrade directly addresses this by enabling INT8 serving of 70B-class models on a single chip.

Power Efficiency Analysis. The 910B's 400W TDP delivers an interesting efficiency profile:

MetricAscend 910BNVIDIA A100 (400W)NVIDIA H100 (700W)MI300X (750W)
FP16 TFLOPS~6003129901,307
TDP400W400W700W750W
TFLOPS/Watt (FP16)~1.500.781.411.74
Memory Bandwidth/Watt3.0 GB/s/W5.1 GB/s/W4.8 GB/s/W7.1 GB/s/W
Price-per-Watt$20-$30/W$25-$38/W$36-$57/W$13-$20/W

The 910B achieves competitive TFLOPS per watt for FP16 compute, actually exceeding the H100 on this specific metric. However, the bandwidth-per-watt story is much weaker - the 910B's 3.0 GB/s/W is the lowest in the comparison. For inference workloads where bandwidth determines throughput, the 910B is actually the least power-efficient option despite its low absolute wattage.

Generational and Competitive Context

vs. Huawei Ascend 910C. The 910C is a 50% memory and bandwidth upgrade with ~33% more compute, at roughly 50% higher price. For new deployments, the 910C is the better value. For existing 910B installations, the upgrade decision depends on workload: if 64GB memory is sufficient (models at 34B and below in INT8), the 910B remains viable. If you need to serve 70B models on single chips, the 910C's 96GB is essential.

vs. NVIDIA A100 80GB. The A100 was the 910B's direct competitor before export controls removed it from the Chinese market. On paper, the A100 has lower FP16 compute (312 vs ~600 TFLOPS) but 25% more memory (80GB vs 64GB) and 70% more bandwidth (2,039 vs ~1,200 GB/s). In practice, the A100's much more mature CUDA software stack meant that real-world workloads typically ran faster on A100 than on 910B despite the lower peak TFLOPS. This performance gap has narrowed as CANN has matured, but the A100 remains the stronger chip for organizations that have the choice. Chinese organizations no longer have that choice.

vs. NVIDIA H100. The H100 outperforms the 910B by approximately 2x on most workloads. The comparison is academic for the Chinese market because the H100 is not available. But it matters for understanding what performance level Chinese AI is operating at - and what gaps DeepSeek and others are working to close through algorithmic and software innovation rather than hardware brute force.

vs. AMD MI300X. The MI300X is in an entirely different performance class: 3x the memory (192GB), 4.4x the bandwidth (5,300 GB/s), and 2.2x the FP16 compute (1,307 TFLOPS). Like the H100, the MI300X is subject to export controls and unavailable in China.

Export Control Context. The 910B was the first mass-produced Ascend chip after the October 2022 export controls. Its existence proved that Chinese AI development could continue despite being cut off from NVIDIA and AMD hardware. The strategic significance of this cannot be overstated - the 910B demonstrated that SMIC's 7nm DUV process could produce functional AI accelerators in volume, that Huawei's Da Vinci architecture was viable for frontier workloads, and that the CANN software stack could support production training. Every subsequent Ascend chip builds on this foundation.

The October 2023 export control update added bandwidth thresholds and computing density limits to the original compute-threshold restrictions. The 910B falls within the updated thresholds primarily because its bandwidth is significantly lower than Western alternatives. Future Ascend generations that attempt to close the bandwidth gap may face additional scrutiny.

The 910B in the Context of China's AI Hardware Landscape. The 910B is not the only Chinese AI chip, but it is the most significant. Other domestic options include:

Chinese AI ChipManufacturerFP16 Compute (est.)MemoryStatus
Ascend 910BHuawei (HiSilicon)~600 TFLOPS64GB HBM2eMass production, 50K+ deployed
Ascend 910CHuawei (HiSilicon)~800 TFLOPS96GB HBM2eProduction ramp
Biren BR100Biren Technology~512 TFLOPS (claimed)64GB HBM2eLimited production
Enflame Cloudblazer i20Enflame~256 TFLOPS (est.)32GB HBM2eProduction
Moore Threads MTT S4000Moore Threads~200 TFLOPS (est.)48GB GDDR6XProduction (GPU)

The Ascend line dominates the Chinese AI accelerator market in both deployment volume and performance. Biren's BR100 was initially competitive on paper but faced its own sanctions-related challenges and has not achieved the same production scale. The 910B's scale of deployment (50,000+ chips) dwarfs all other Chinese AI chip programs combined.

CANN Ecosystem Maturity Timeline. The CANN software stack has evolved significantly since the 910B's launch:

TimelineCANN MilestoneImpact
H2 2023910B launch, CANN 6.xBasic PyTorch support, MindSpore primary
Q1 2024CANN 7.0Improved PyTorch compatibility, more operators
Q3 2024DeepSeek begins 910B optimizationForces rapid operator development
Q4 2024CANN 7.x / 910C launchINT8 quantization improvements, better distributed training
2025CANN 8.xEnhanced PyTorch integration, DeepSeek V4 optimization

Each milestone has been driven by real production demands from Chinese AI companies. The virtuous cycle of demand (from DeepSeek, Baidu, Alibaba) forcing CANN improvements which attract more users is the same dynamic that built the CUDA ecosystem - just compressed into 2 years instead of 15.

910B Deprecation Timeline. The 910B is not being actively deprecated, but Huawei's focus has shifted to the 910C for new production. The practical implications:

Timeline910B StatusRecommendation
2025 H1Actively supported, new chips availableViable for cluster expansion
2025 H2910C ramp accelerates, 910B production slowsNew deployments should prefer 910C
2026CANN optimization focus shifts to 910C/next-genExisting 910B clusters remain functional
2027+Legacy support, no new hardware developmentPlan migration for critical workloads

Organizations with large 910B installations should plan a gradual transition to 910C for inference-critical workloads (where the 96GB memory matters most) while continuing to use 910B for training clusters where per-chip cost matters more than per-chip performance.

Power and Cooling Comparison. The 910B's 400W TDP is notably lower than competing accelerators, which translates to meaningful differences in data center design and operational cost:

MetricAscend 910BAscend 910CNVIDIA H100
TDP per chip400W600W700W
8-chip server power (GPUs only)3,200W4,800W5,600W
Annual power cost (8 chips, $0.08/kWh, 80% util)$17,900$26,860$31,340
Cooling approachAir or liquidLiquid recommendedLiquid recommended
Rack density limitationModerateHigherHighest

For Chinese data centers where electricity costs are typically $0.06-$0.10/kWh, the 910B's lower power consumption is a real operational advantage. A 2,000-chip 910B deployment saves approximately $2.5-$3.5 million per year in electricity costs compared to an equivalent 2,000-chip H100 deployment. Over a 3-year deployment lifetime, this partially offsets the performance gap.

DeepSeek's Impact on the Ascend Ecosystem. DeepSeek's use of 910B chips catalyzed the entire Ascend software ecosystem in several ways:

  1. CANN operator coverage expanded - DeepSeek's frontier model training required operators that CANN initially lacked, forcing Huawei to accelerate development
  2. Multi-chip training at scale proved viable - Running 1,000+ 910B clusters for training validated HCCS at production scale
  3. Communication-efficient algorithms emerged - Techniques developed for 910B's limited HCCS bandwidth (gradient compression, async all-reduce) are now used across the Ascend ecosystem
  4. Ecosystem credibility increased - DeepSeek V3's competitive benchmark results proved that Ascend-trained models are not inherently inferior

Use Case Recommendations

Strong Fit:

  • Chinese AI companies needing large-scale training clusters. The 910B's lower price ($8,000-$12,000 vs the 910C's $12,000-$18,000) makes it economical for building large clusters where per-chip performance matters less than aggregate throughput. If you need 2,000+ chips for distributed training, the 910B's cost advantage is significant.
  • Inference serving of sub-34B parameter models. At 64GB, the 910B can hold models up to ~34B parameters in INT8 on a single chip. For Chinese inference providers serving models like ChatGLM-6B, Qwen-14B, or Baichuan-13B, the 910B provides adequate single-chip capacity at the lowest price point in the Ascend line.
  • Organizations with existing 910B CANN expertise. If your team has already invested in CANN optimization for 910B, the incremental cost of additional 910B chips (versus the higher cost of 910C chips) may be justified for workloads that fit within 64GB.
  • Government and research institutions with guaranteed procurement. Chinese government AI procurement programs often specify Ascend hardware. The 910B's large installed base (50,000+ chips estimated) means better supply certainty and more reference architectures than the newer 910C.

Weak Fit:

  • Inference serving of 70B+ parameter models. The 64GB memory hard-caps the model sizes that fit on a single chip. Even with INT8 quantization, a 70B model (~70GB) exceeds the 910B's capacity. The 910C with 96GB is the minimum for this use case on Ascend hardware.
  • Workloads requiring maximum memory bandwidth. At 1,200 GB/s, the 910B has the lowest memory bandwidth of any current-generation AI accelerator in this comparison set. Memory-bandwidth-bound workloads (LLM inference decode, long-context attention) will be significantly slower per chip than on any alternative.
  • New deployments when 910C is available. Unless budget is the primary constraint, new Ascend deployments should prefer the 910C for its 50% memory and bandwidth improvements. The 910B is best used for expanding existing 910B clusters rather than starting new ones.
  • Organizations outside China. The 910B is unavailable outside China, and CANN documentation and community support are primarily in Chinese. Western organizations should use MI300X, H100, or Google TPU alternatives.
  • Latency-sensitive real-time inference. The 910B's lower bandwidth and less-optimized inference stack make it poorly suited for applications requiring sub-100ms token generation latency. The per-token decode time on 910B is roughly 2-3x slower than on H100 for equivalent models.
  • Mixed workloads spanning AI and general-purpose computing. The 910B is an ASIC optimized for matrix multiplication. Unlike NVIDIA GPUs, it cannot handle general-purpose GPU compute, rendering, or simulation workloads. If your infrastructure needs to serve dual purposes, a GPU-based solution is more flexible.

Strengths

  • Proven at frontier-model scale - DeepSeek's training validation is the strongest possible credential
  • Lowest absolute price in the comparison at $8,000-$12,000 per accelerator
  • 400W TDP provides competitive power efficiency (TFLOPS per watt)
  • Manufactured entirely in China - immune to US export control supply disruptions
  • Established CANN software ecosystem with production-grade PyTorch support
  • Large installed base (50,000+ chips estimated) creates network effects for software optimization
  • Serves as the foundation for China's independent AI hardware ecosystem

Weaknesses

  • 64GB HBM2e limits single-chip model deployment - cannot fit 70B models even with INT8 quantization
  • Memory bandwidth (~1,200 GB/s) is only 36% of the H100's 3,350 GB/s - a severe inference bottleneck
  • ~600 TFLOPS FP16 is roughly 60% of the H100 and 46% of the MI300X
  • CANN software ecosystem is significantly smaller and less mature than CUDA
  • No confirmed FP8 hardware support limits quantized inference performance
  • HCCS interconnect bandwidth trails NVLink substantially for multi-chip training communication
  • Already superseded by the 910C - new deployments are shifting to the newer chip

Sources

Huawei Ascend 910B
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.