TL;DR

Huawei's workhorse AI accelerator - the chip that reportedly powered DeepSeek V3's training alongside NVIDIA hardware
64GB HBM2e memory with ~1,200 GB/s bandwidth and an estimated ~600 TFLOPS FP16 compute
Built on SMIC's 7nm process - designed to be manufacturable entirely within China under US sanctions
400W TDP makes it significantly more power-efficient than Western alternatives at the cost of lower peak performance
Estimated 50,000+ chips deployed across Chinese hyperscalers and government AI projects

Overview

The Huawei Ascend 910B is the chip that answered the question everyone was asking after the US imposed export controls on advanced AI hardware: can China build its own AI accelerators that are good enough to train frontier models? The answer, demonstrated by DeepSeek and others throughout 2024-2025, is yes - with caveats. The 910B is not an H100 killer. It has roughly 60% of the H100's compute, 36% of its memory bandwidth, and a software ecosystem that is years behind CUDA. But it works, it is available in volume, and it has trained models that compete with the best in the world.

Released in H2 2023, the 910B was Huawei's first production AI accelerator after the October 2022 US export controls blocked Chinese companies from purchasing NVIDIA's A100 and H100 chips. The timing was not coincidental - Huawei had been developing Ascend silicon since 2018, and the export controls transformed what was a nice-to-have domestic alternative into a strategic necessity. The chip packs 64GB of HBM2e memory and delivers an estimated 600 TFLOPS of FP16 compute, placing it roughly between the NVIDIA A100 and H100 in performance.

What makes the 910B historically significant is DeepSeek. Multiple reports indicate that DeepSeek used Ascend 910B chips alongside pre-sanctions NVIDIA hardware to train their V3 and V3.1 models. The V3 technical report describes training on 2,048 NVIDIA H800 GPUs (export-compliant variants of the H100), but DeepSeek's broader infrastructure reportedly includes significant Ascend capacity. The company's decision to optimize DeepSeek V4 specifically for Ascend hardware rather than NVIDIA GPUs is a direct extension of the experience gained running production workloads on the 910B. In this sense, the 910B is the chip that bootstrapped China's independent AI hardware ecosystem.

Key Specifications

Specification	Details
Manufacturer	Huawei (HiSilicon)
Product Family	Ascend 910
Architecture	Da Vinci
Process Node	SMIC 7nm (N+2)
Chip Type	ASIC
AI Cores	32 Da Vinci cores (estimated)
FP16 Performance	~600 TFLOPS (estimated)
BF16 Performance	~600 TFLOPS (estimated)
INT8 Performance	~1,200 TOPS (estimated)
Memory	64GB HBM2e
Memory Stacks	4x HBM2e (estimated)
Memory Bandwidth	~1,200 GB/s
Interconnect	HCCS (Huawei Cache Coherence System)
TDP	400W
Form Factor	Proprietary module
Software Stack	CANN (Compute Architecture for Neural Networks)
Target Workload	Training and Inference
Release Date	H2 2023
Estimated Price	$8,000-$12,000

Note: Huawei does not publish detailed specifications for the Ascend 910B. Performance figures are based on third-party analysis, leaked testing data, and industry estimates. Actual specifications may differ.

Performance Benchmarks (Estimated)

Benchmark / Metric	Ascend 910B	NVIDIA H100 SXM	NVIDIA A100 80GB	Ascend 910C
FP16 Peak (TFLOPS)	~600	990	312	~800
Memory Capacity	64GB	80GB	80GB	96GB
Memory Bandwidth	~1,200 GB/s	3,350 GB/s	2,039 GB/s	~1,800 GB/s
LLM Inference (relative to H100)	~0.4-0.5x	1.0x (baseline)	~0.5-0.6x	~0.6-0.7x
Training Throughput (relative to H100)	~0.3-0.4x	1.0x (baseline)	~0.5x	~0.5-0.6x
Power (TDP)	400W	700W	400W	600W
Power Efficiency (TFLOPS/W)	~1.5	~1.4	~0.78	~1.33
Price (estimated)	$8,000-$12,000	$25,000-$40,000	$10,000-$15,000	$12,000-$18,000

The 910B's performance relative to the H100 is often characterized as "roughly half," but that oversimplifies the picture. For compute-bound training on large batch sizes, the gap is closer to 2.5-3x because the H100's higher compute and memory bandwidth compound. For smaller models and memory-capacity-bound inference where 64GB is sufficient, the gap narrows to 1.5-2x. The workload-dependent performance variation is wider on the 910B than on NVIDIA hardware because CANN's operator library is less comprehensively optimized.

The power efficiency story is interesting. At 400W TDP with ~600 TFLOPS FP16, the 910B achieves approximately 1.5 TFLOPS per watt - slightly better than the H100's ~1.4 TFLOPS per watt. This is a consequence of the lower clock speeds and SMIC 7nm's power characteristics. For data center operators where electricity cost is a meaningful factor, the 910B's efficiency partially offsets its lower absolute performance. Over a multi-year deployment, the TCO calculation can be closer than the raw performance numbers suggest.

Key Capabilities

DeepSeek Training Validation. The 910B's strongest credential is that it has been used to train models that compete with the best in the world. DeepSeek's V3 family of models, which achieve frontier-level performance on numerous benchmarks, were trained on infrastructure that includes Ascend 910B chips. This is not a synthetic benchmark result or a marketing claim - it is a real-world demonstration that the 910B can handle the demands of frontier model training when combined with sufficient scale and software optimization. The experience DeepSeek gained on the 910B directly informed their decision to build V4 primarily for Ascend hardware.

CANN Ecosystem Foundation. The 910B is where Huawei's CANN software stack had to grow up. Early CANN releases were rough - limited operator coverage, frequent compatibility issues with PyTorch, and debugging tools that were primitive compared to CUDA's ecosystem. But the pressure of real production workloads from DeepSeek, Baidu, and others forced rapid iteration. By the time the 910C launched, CANN had matured to the point where frontier model training was feasible without heroic engineering effort. The 910B bore the cost of that maturation process, and current 910B deployments benefit from the improvements driven by early adopter pain.

Volume Deployment at Scale. Estimates suggest 50,000 or more Ascend 910B chips have been deployed across China's AI infrastructure. This installed base matters because it creates a self-reinforcing ecosystem - more chips deployed means more developers writing CANN code, which means better libraries and tools, which makes the next deployment easier. Chinese cloud providers including Huawei Cloud, Alibaba Cloud, and Baidu Cloud offer 910B-based instances for AI training. Government-funded computing centers and research institutions have also adopted the 910B as part of China's strategic push for AI hardware independence.

Pricing and Availability

The Ascend 910B is estimated to cost between $8,000 and $12,000 per accelerator, though actual pricing varies based on volume, customer relationship, and government subsidies. Huawei does not publish official MSRPs. The 910B is available through Huawei's Atlas 800 and Atlas 900 server platforms, as well as through Chinese cloud providers.

Accelerator	Estimated Price	Memory	FP16 TFLOPS	TDP	Price per TFLOP
Huawei Ascend 910B	$8,000-$12,000	64GB	~600	400W	$13-$20
Huawei Ascend 910C	$12,000-$18,000	96GB	~800	600W	$15-$23
NVIDIA A100 80GB	$10,000-$15,000	80GB	312	400W	$32-$48
NVIDIA H100 SXM	$25,000-$40,000	80GB	990	700W	$25-$40
AMD MI300X	$10,000-$15,000	192GB	1,307	750W	$8-$11

On a price-per-TFLOP basis, the 910B is competitive with the NVIDIA A100 and significantly cheaper than the H100. However, this metric does not account for the software ecosystem overhead - achieving those theoretical TFLOPS on the 910B requires CANN optimization work that is not necessary on CUDA. The effective price-per-useful-TFLOP is higher than the hardware numbers suggest, particularly for organizations porting existing CUDA workloads.

The 910B is effectively unavailable outside of China. US export controls and Huawei's entity listing mean that the chip cannot be exported to most markets. Within China, availability has been reasonably good, though Huawei prioritizes large orders from strategic customers.

Architecture Deep Dive

The Ascend 910B's Da Vinci architecture was designed to maximize AI compute throughput on a process node that most Western chip designers would consider two generations behind. Understanding how Huawei extracted competitive performance from SMIC 7nm requires looking at the architectural trade-offs that HiSilicon made.

Da Vinci Core Design. Each Da Vinci core in the 910B is built around a 3D Cube Computing Engine - Huawei's name for a systolic array structure that performs dense matrix multiplication. The cube operates on 16x16x16 matrices per cycle, with native FP16, BF16, and INT8 data type support.

Da Vinci Core Component	Function	910B Specification (est.)
Cube Unit	16x16x16 matrix multiply	~18.75 TFLOPS FP16 per core
Vector Unit	Activation, normalization, element-wise ops	128-bit vector width (est.)
Scalar Unit	Control flow, index computation	Standard RISC pipeline
Unified Buffer	On-chip SRAM scratch pad	~256KB per core (est.)
Total Cores	Full chip	32 Da Vinci cores (est.)

The per-core FP16 throughput of approximately 18.75 TFLOPS x 32 cores yields the chip-level ~600 TFLOPS FP16 estimate. By comparison, an NVIDIA H100's SM (Streaming Multiprocessor) delivers roughly 12.4 TFLOPS FP16 across 132 SMs for 990 TFLOPS total (excluding sparsity). The 910B achieves fewer total TFLOPS with fewer but wider cores - a design choice that simplifies the on-chip interconnect and reduces control overhead at the cost of fewer independent execution units.

Memory Subsystem. The 910B's 64GB HBM2e memory is organized in 4 stacks:

Memory Property	Ascend 910B	NVIDIA A100 80GB	NVIDIA H100 SXM
Total Capacity	64GB	80GB	80GB
HBM Generation	HBM2e	HBM2e	HBM3
Number of Stacks	4 (est.)	5	5
Per-Stack Bandwidth	~300 GB/s	~408 GB/s	~670 GB/s
Aggregate Bandwidth	~1,200 GB/s	2,039 GB/s	3,350 GB/s
Bandwidth-to-Compute Ratio	2.0 GB/s per TFLOP	6.5 GB/s per TFLOP	3.4 GB/s per TFLOP

The bandwidth-to-compute ratio tells an important story. At 2.0 GB/s per TFLOP, the 910B is more compute-bound relative to its memory system than either the A100 or H100. This means the compute engine is often waiting for data, particularly during inference where memory bandwidth is the primary bottleneck. For training with large batch sizes (where compute dominates), the ratio matters less because the matrix multiplications are large enough to keep the cube units busy.

HCCS Interconnect Architecture. Huawei's HCCS (Huawei Cache Coherence System) connects multiple 910B chips for distributed workloads:

HCCS Property	910B (est.)	Notes
Bidirectional Bandwidth per Link	~56 GB/s (est.)	Lower than NVLink's ~50 GB/s per link but fewer total links
Total Links per Chip	~6 (est.)	Configurable per platform
Aggregate Bidirectional Bandwidth	~336 GB/s (est.)	vs. 900 GB/s for NVLink 4.0
Max Chips per Node (Atlas 900)	8	Similar to DGX H100
Inter-node Communication	RoCE v2 (RDMA over Converged Ethernet)	No InfiniBand option

The ~336 GB/s aggregate interconnect bandwidth (estimated) is roughly 37% of NVLink 4.0's 900 GB/s. This gap is the 910B's most significant architectural limitation for large-scale training, where all-reduce operations across many chips require high bandwidth and low latency. DeepSeek's training innovations on Ascend hardware focused heavily on algorithmic techniques to reduce communication volume - gradient compression, communication-computation overlap, and asynchronous all-reduce - specifically to work around this bottleneck.

SMIC 7nm Manufacturing. The 910B is manufactured on SMIC's N+2 process, which is SMIC's most advanced production node. Key characteristics:

Process Property	SMIC N+2 (7nm DUV)	TSMC N7 (7nm EUV)	Difference
Lithography	DUV multi-patterning	EUV single-patterning	Higher cost, lower yield on SMIC
Transistor Density	~85-90 MTr/mm2 (est.)	~96 MTr/mm2	~10% lower on SMIC
Metal Layers	Comparable	Comparable	Similar backend
Defect Density	Higher (DUV multi-patterning)	Lower	Lower yield on SMIC
Wafer Cost	Higher per good die	Lower per good die	~30-50% cost premium (est.)

The DUV multi-patterning approach requires exposing each critical layer multiple times with different masks, aligned with nanometer precision. This increases cost per wafer and reduces yield compared to EUV single-patterning. SMIC has reportedly achieved acceptable yields for the 910B die size, but the process is inherently more expensive per transistor than TSMC's equivalent node.

Atlas Server Platform. The 910B is deployed through Huawei's Atlas server line:

Platform	Configuration	Total HBM	Power	Target
Atlas 800 (Model 9000)	8x 910B	512GB	~3,200W (GPUs)	Training clusters
Atlas 800 (Model 9010)	8x 910B	512GB	~3,200W (GPUs)	Inference serving
Atlas 300T Training Card	1x 910B	64GB	~400W	PCIe add-in card
Atlas 900 PoD	64x 910B	4TB	~25.6kW (GPUs)	Large-scale training

The Atlas 800 with 8x 910B provides 512GB of aggregate HBM2e - sufficient to hold a 70B parameter model in BF16 across the 8 chips with tensor parallelism. The Atlas 900 PoD configuration connects 8 Atlas 800 nodes for a total of 64 chips with 4TB of aggregate memory, suitable for training models up to approximately 200B parameters with appropriate parallelism strategies.

Software Stack Architecture. The CANN software stack is layered, with each level providing increasing abstraction:

CANN Layer	Function	Equivalent in CUDA Ecosystem
Ascend Hardware Driver	Low-level device control	NVIDIA kernel driver
ACL (Ascend Computing Language)	Runtime API, memory management	CUDA Runtime API
AscendCL Operators	Optimized kernels for common operations	cuDNN, cuBLAS
CANN Fusion Engine	Graph optimization, operator fusion	TensorRT graph optimizer
Framework Adapters	PyTorch/TF/MindSpore integration	PyTorch CUDA backend
MindSpore	Huawei's native AI framework	No direct equivalent

The CANN stack has improved significantly since its initial release with the 910B. Early versions required MindSpore as the primary framework, but current CANN releases provide a mature PyTorch adapter that handles most standard operations. The remaining gaps are in specialized operators - custom attention implementations, sparse operations, and quantization-specific kernels that have been extensively tuned for CUDA but receive less optimization effort on CANN.

Real-World Performance Analysis

DeepSeek Training Performance. The 910B's most important performance data comes from its role in DeepSeek's training infrastructure. While DeepSeek's V3 technical report primarily describes training on 2,048 NVIDIA H800 GPUs, multiple industry sources confirm that DeepSeek also operates significant Ascend 910B clusters for experimentation, data processing, and model development.

Training Metric	Ascend 910B (est.)	NVIDIA H800 (measured, DeepSeek V3 report)	Ratio
Model FLOPs Utilization (MFU)	~35-42%	~54-58%	0.65-0.72x
Tokens/sec/chip (DeepSeek V3 scale)	~2,000-2,500 (est.)	~4,200-4,800	~0.5x
Training days for 14.8T tokens (2K chips)	~75-100 days (est.)	~55 days (reported)	~1.4-1.8x longer
Communication overhead (fraction of step time)	~25-35% (est.)	~15-20%	Higher on Ascend

The Model FLOPs Utilization gap (35-42% vs 54-58%) is telling. It reflects both the HCCS bandwidth limitation (more time spent on communication) and the CANN software stack's lower optimization level compared to CUDA (less efficient kernel execution). The combination means that achieving comparable training throughput on 910B requires approximately 2x the chip count of H800/H100 - which is a viable strategy if the chips are available at lower cost per unit.

Inference Serving in Production. Chinese cloud providers and AI companies have deployed 910B chips for production inference serving. Approximate performance numbers based on reported benchmarks:

Model	Precision	910B (single chip)	H100 SXM (single chip)	Notes
ChatGLM-6B	FP16	~55-70 tok/s	~150-200 tok/s	Small model, compute-limited
Qwen-14B	INT8	~25-35 tok/s	~70-90 tok/s	Mid-size model
Baichuan 2-13B	INT8	~28-38 tok/s	~75-95 tok/s	Popular Chinese model
Yi-34B	INT8 (~34GB)	~15-20 tok/s	~55-70 tok/s	Fits in 64GB
Llama 70B (INT8)	INT8 (~70GB)	N/A (OOM)	~40-50 tok/s	Exceeds 910B's 64GB

The 64GB memory constraint is the defining limitation for 910B inference. Any model that exceeds 64GB in its target precision - which includes all 70B+ models at INT8 - requires multi-chip tensor parallelism on the 910B, with associated HCCS communication overhead. The 910C's 96GB upgrade directly addresses this by enabling INT8 serving of 70B-class models on a single chip.

Power Efficiency Analysis. The 910B's 400W TDP delivers an interesting efficiency profile:

Metric	Ascend 910B	NVIDIA A100 (400W)	NVIDIA H100 (700W)	MI300X (750W)
FP16 TFLOPS	~600	312	990	1,307
TDP	400W	400W	700W	750W
TFLOPS/Watt (FP16)	~1.50	0.78	1.41	1.74
Memory Bandwidth/Watt	3.0 GB/s/W	5.1 GB/s/W	4.8 GB/s/W	7.1 GB/s/W
Price-per-Watt	$20-$30/W	$25-$38/W	$36-$57/W	$13-$20/W

The 910B achieves competitive TFLOPS per watt for FP16 compute, actually exceeding the H100 on this specific metric. However, the bandwidth-per-watt story is much weaker - the 910B's 3.0 GB/s/W is the lowest in the comparison. For inference workloads where bandwidth determines throughput, the 910B is actually the least power-efficient option despite its low absolute wattage.

Generational and Competitive Context

vs. Huawei Ascend 910C. The 910C is a 50% memory and bandwidth upgrade with ~33% more compute, at roughly 50% higher price. For new deployments, the 910C is the better value. For existing 910B installations, the upgrade decision depends on workload: if 64GB memory is sufficient (models at 34B and below in INT8), the 910B remains viable. If you need to serve 70B models on single chips, the 910C's 96GB is essential.

vs. NVIDIA A100 80GB. The A100 was the 910B's direct competitor before export controls removed it from the Chinese market. On paper, the A100 has lower FP16 compute (312 vs ~600 TFLOPS) but 25% more memory (80GB vs 64GB) and 70% more bandwidth (2,039 vs ~1,200 GB/s). In practice, the A100's much more mature CUDA software stack meant that real-world workloads typically ran faster on A100 than on 910B despite the lower peak TFLOPS. This performance gap has narrowed as CANN has matured, but the A100 remains the stronger chip for organizations that have the choice. Chinese organizations no longer have that choice.

vs. NVIDIA H100. The H100 outperforms the 910B by approximately 2x on most workloads. The comparison is academic for the Chinese market because the H100 is not available. But it matters for understanding what performance level Chinese AI is operating at - and what gaps DeepSeek and others are working to close through algorithmic and software innovation rather than hardware brute force.

vs. AMD MI300X. The MI300X is in an entirely different performance class: 3x the memory (192GB), 4.4x the bandwidth (5,300 GB/s), and 2.2x the FP16 compute (1,307 TFLOPS). Like the H100, the MI300X is subject to export controls and unavailable in China.

Export Control Context. The 910B was the first mass-produced Ascend chip after the October 2022 export controls. Its existence proved that Chinese AI development could continue despite being cut off from NVIDIA and AMD hardware. The strategic significance of this cannot be overstated - the 910B demonstrated that SMIC's 7nm DUV process could produce functional AI accelerators in volume, that Huawei's Da Vinci architecture was viable for frontier workloads, and that the CANN software stack could support production training. Every subsequent Ascend chip builds on this foundation.

The October 2023 export control update added bandwidth thresholds and computing density limits to the original compute-threshold restrictions. The 910B falls within the updated thresholds primarily because its bandwidth is significantly lower than Western alternatives. Future Ascend generations that attempt to close the bandwidth gap may face additional scrutiny.

The 910B in the Context of China's AI Hardware Landscape. The 910B is not the only Chinese AI chip, but it is the most significant. Other domestic options include:

Chinese AI Chip	Manufacturer	FP16 Compute (est.)	Memory	Status
Ascend 910B	Huawei (HiSilicon)	~600 TFLOPS	64GB HBM2e	Mass production, 50K+ deployed
Ascend 910C	Huawei (HiSilicon)	~800 TFLOPS	96GB HBM2e	Production ramp
Biren BR100	Biren Technology	~512 TFLOPS (claimed)	64GB HBM2e	Limited production
Enflame Cloudblazer i20	Enflame	~256 TFLOPS (est.)	32GB HBM2e	Production
Moore Threads MTT S4000	Moore Threads	~200 TFLOPS (est.)	48GB GDDR6X	Production (GPU)

The Ascend line dominates the Chinese AI accelerator market in both deployment volume and performance. Biren's BR100 was initially competitive on paper but faced its own sanctions-related challenges and has not achieved the same production scale. The 910B's scale of deployment (50,000+ chips) dwarfs all other Chinese AI chip programs combined.

CANN Ecosystem Maturity Timeline. The CANN software stack has evolved significantly since the 910B's launch:

Timeline	CANN Milestone	Impact
H2 2023	910B launch, CANN 6.x	Basic PyTorch support, MindSpore primary
Q1 2024	CANN 7.0	Improved PyTorch compatibility, more operators
Q3 2024	DeepSeek begins 910B optimization	Forces rapid operator development
Q4 2024	CANN 7.x / 910C launch	INT8 quantization improvements, better distributed training
2025	CANN 8.x	Enhanced PyTorch integration, DeepSeek V4 optimization

Each milestone has been driven by real production demands from Chinese AI companies. The virtuous cycle of demand (from DeepSeek, Baidu, Alibaba) forcing CANN improvements which attract more users is the same dynamic that built the CUDA ecosystem - just compressed into 2 years instead of 15.

910B Deprecation Timeline. The 910B is not being actively deprecated, but Huawei's focus has shifted to the 910C for new production. The practical implications:

Timeline	910B Status	Recommendation
2025 H1	Actively supported, new chips available	Viable for cluster expansion
2025 H2	910C ramp accelerates, 910B production slows	New deployments should prefer 910C
2026	CANN optimization focus shifts to 910C/next-gen	Existing 910B clusters remain functional
2027+	Legacy support, no new hardware development	Plan migration for critical workloads

Organizations with large 910B installations should plan a gradual transition to 910C for inference-critical workloads (where the 96GB memory matters most) while continuing to use 910B for training clusters where per-chip cost matters more than per-chip performance.

Power and Cooling Comparison. The 910B's 400W TDP is notably lower than competing accelerators, which translates to meaningful differences in data center design and operational cost:

Metric	Ascend 910B	Ascend 910C	NVIDIA H100
TDP per chip	400W	600W	700W
8-chip server power (GPUs only)	3,200W	4,800W	5,600W
Annual power cost (8 chips, $0.08/kWh, 80% util)	$17,900	$26,860	$31,340
Cooling approach	Air or liquid	Liquid recommended	Liquid recommended
Rack density limitation	Moderate	Higher	Highest

For Chinese data centers where electricity costs are typically $0.06-$0.10/kWh, the 910B's lower power consumption is a real operational advantage. A 2,000-chip 910B deployment saves approximately $2.5-$3.5 million per year in electricity costs compared to an equivalent 2,000-chip H100 deployment. Over a 3-year deployment lifetime, this partially offsets the performance gap.

DeepSeek's Impact on the Ascend Ecosystem. DeepSeek's use of 910B chips catalyzed the entire Ascend software ecosystem in several ways:

CANN operator coverage expanded - DeepSeek's frontier model training required operators that CANN initially lacked, forcing Huawei to accelerate development
Multi-chip training at scale proved viable - Running 1,000+ 910B clusters for training validated HCCS at production scale
Communication-efficient algorithms emerged - Techniques developed for 910B's limited HCCS bandwidth (gradient compression, async all-reduce) are now used across the Ascend ecosystem
Ecosystem credibility increased - DeepSeek V3's competitive benchmark results proved that Ascend-trained models are not inherently inferior

Use Case Recommendations

Strong Fit:

Chinese AI companies needing large-scale training clusters. The 910B's lower price ($8,000-$12,000 vs the 910C's $12,000-$18,000) makes it economical for building large clusters where per-chip performance matters less than aggregate throughput. If you need 2,000+ chips for distributed training, the 910B's cost advantage is significant.
Inference serving of sub-34B parameter models. At 64GB, the 910B can hold models up to ~34B parameters in INT8 on a single chip. For Chinese inference providers serving models like ChatGLM-6B, Qwen-14B, or Baichuan-13B, the 910B provides adequate single-chip capacity at the lowest price point in the Ascend line.
Organizations with existing 910B CANN expertise. If your team has already invested in CANN optimization for 910B, the incremental cost of additional 910B chips (versus the higher cost of 910C chips) may be justified for workloads that fit within 64GB.
Government and research institutions with guaranteed procurement. Chinese government AI procurement programs often specify Ascend hardware. The 910B's large installed base (50,000+ chips estimated) means better supply certainty and more reference architectures than the newer 910C.

Weak Fit:

Inference serving of 70B+ parameter models. The 64GB memory hard-caps the model sizes that fit on a single chip. Even with INT8 quantization, a 70B model (~70GB) exceeds the 910B's capacity. The 910C with 96GB is the minimum for this use case on Ascend hardware.
Workloads requiring maximum memory bandwidth. At 1,200 GB/s, the 910B has the lowest memory bandwidth of any current-generation AI accelerator in this comparison set. Memory-bandwidth-bound workloads (LLM inference decode, long-context attention) will be significantly slower per chip than on any alternative.
New deployments when 910C is available. Unless budget is the primary constraint, new Ascend deployments should prefer the 910C for its 50% memory and bandwidth improvements. The 910B is best used for expanding existing 910B clusters rather than starting new ones.
Organizations outside China. The 910B is unavailable outside China, and CANN documentation and community support are primarily in Chinese. Western organizations should use MI300X, H100, or Google TPU alternatives.
Latency-sensitive real-time inference. The 910B's lower bandwidth and less-optimized inference stack make it poorly suited for applications requiring sub-100ms token generation latency. The per-token decode time on 910B is roughly 2-3x slower than on H100 for equivalent models.
Mixed workloads spanning AI and general-purpose computing. The 910B is an ASIC optimized for matrix multiplication. Unlike NVIDIA GPUs, it cannot handle general-purpose GPU compute, rendering, or simulation workloads. If your infrastructure needs to serve dual purposes, a GPU-based solution is more flexible.

Strengths

Proven at frontier-model scale - DeepSeek's training validation is the strongest possible credential
Lowest absolute price in the comparison at $8,000-$12,000 per accelerator
400W TDP provides competitive power efficiency (TFLOPS per watt)
Manufactured entirely in China - immune to US export control supply disruptions
Established CANN software ecosystem with production-grade PyTorch support
Large installed base (50,000+ chips estimated) creates network effects for software optimization
Serves as the foundation for China's independent AI hardware ecosystem

Weaknesses

64GB HBM2e limits single-chip model deployment - cannot fit 70B models even with INT8 quantization
Memory bandwidth (~1,200 GB/s) is only 36% of the H100's 3,350 GB/s - a severe inference bottleneck
~600 TFLOPS FP16 is roughly 60% of the H100 and 46% of the MI300X
CANN software ecosystem is significantly smaller and less mature than CUDA
No confirmed FP8 hardware support limits quantized inference performance
HCCS interconnect bandwidth trails NVLink substantially for multi-chip training communication
Already superseded by the 910C - new deployments are shifting to the newer chip

Huawei Ascend 910C - The 910B's successor with 96GB memory and higher compute
DeepSeek V4 - Frontier model optimized for Ascend hardware, building on 910B experience
AMD Instinct MI300X - AMD's 192GB alternative to NVIDIA
AMD Instinct MI350X - AMD's next-gen CDNA 4 flagship
Google TPU v6e Trillium - Google's cloud TPU alternative
Google TPU v7 Ironwood - Google's next-gen inference TPU