Hardware

Huawei Ascend 910C

Huawei Ascend 910C specs, benchmarks, and performance analysis. 96GB HBM2e, ~1,800 GB/s bandwidth, ~800 TFLOPS FP16 - China's flagship AI chip under US sanctions.

Huawei Ascend 910C

TL;DR

  • Huawei's flagship AI accelerator - the most powerful AI chip designed and manufactured entirely within China
  • 96GB HBM2e memory with ~1,800 GB/s bandwidth, a substantial upgrade over the 910B's 64GB and ~1,200 GB/s
  • Estimated ~800 TFLOPS FP16 compute - roughly comparable to an NVIDIA A100 but trailing the H100 significantly
  • Built on SMIC's 7nm process node under US export control constraints that block access to EUV lithography
  • Powers an expanding Chinese AI ecosystem including DeepSeek, Baidu, Alibaba, and ByteDance workloads

Overview

The Huawei Ascend 910C is the most capable AI accelerator that China can currently design and manufacture domestically. That sentence carries two meanings, and both matter. It is genuinely impressive hardware - 96GB of HBM2e, an estimated 800 TFLOPS of FP16 compute, and a growing software ecosystem. It is also constrained by US export controls that deny Huawei access to TSMC's advanced nodes and EUV lithography equipment, forcing SMIC's 7nm DUV process to carry the entire manufacturing burden.

Released in H2 2024, the 910C represents a meaningful generational improvement over the Ascend 910B. Memory capacity jumps from 64GB to 96GB HBM2e, bandwidth increases from approximately 1,200 GB/s to 1,800 GB/s, and compute performance sees an estimated 30-35% improvement. These gains come from architectural optimization within the same SMIC 7nm process - Huawei cannot simply move to a smaller node the way AMD and NVIDIA can with TSMC's 3nm and 5nm processes. Every performance gain has to come from better circuit design, packaging, and architectural choices rather than transistor shrinks.

The 910C matters geopolitically as much as it matters technically. China's ability to train and deploy frontier AI models depends on domestic hardware supply. DeepSeek's V4 was reportedly optimized for Ascend chips rather than NVIDIA GPUs - a deliberate choice to prove that Chinese AI can be built on Chinese silicon. Baidu, Alibaba, Tencent, and ByteDance have all committed to Ascend-based infrastructure to varying degrees. The 910C is not competing with the H100 on pure performance - it is building an alternative ecosystem that can function independently of American technology. Whether that ecosystem can sustain frontier AI development without access to cutting-edge manufacturing is the trillion-dollar question.

Key Specifications

SpecificationDetails
ManufacturerHuawei (HiSilicon)
Product FamilyAscend 910
ArchitectureDa Vinci (enhanced)
Process NodeSMIC 7nm (N+2)
Chip TypeASIC
AI Cores32 Da Vinci cores (estimated)
FP16 Performance~800 TFLOPS (estimated)
BF16 Performance~800 TFLOPS (estimated)
INT8 Performance~1,600 TOPS (estimated)
Memory96GB HBM2e
Memory Stacks6x HBM2e (estimated)
Memory Bandwidth~1,800 GB/s
InterconnectHCCS (Huawei Cache Coherence System)
TDP600W
Form FactorProprietary module
Software StackCANN (Compute Architecture for Neural Networks)
Target WorkloadTraining and Inference
Release DateH2 2024
Estimated Price$12,000-$18,000

Note: Huawei does not publish detailed specifications for the Ascend 910C. The numbers above are based on leaked documentation, third-party testing, analyst estimates, and inference from Huawei's marketing materials. Actual specifications may differ.

Performance Benchmarks (Estimated)

Benchmark / MetricAscend 910CAscend 910BNVIDIA H100 SXMNVIDIA A100 80GB
FP16 Peak (TFLOPS)~800~600990312
Memory Capacity96GB64GB80GB80GB
Memory Bandwidth~1,800 GB/s~1,200 GB/s3,350 GB/s2,039 GB/s
LLM Inference (relative)~0.6-0.7x H100~0.4-0.5x H1001.0x (baseline)~0.5-0.6x H100
Training Throughput (relative)~0.5-0.6x H100~0.3-0.4x H1001.0x (baseline)~0.5x H100
Power (TDP)600W400W700W400W
Price (estimated)$12,000-$18,000$8,000-$12,000$25,000-$40,000$10,000-$15,000

These relative performance figures are approximate. The 910C's performance relative to the H100 varies significantly by workload type, model architecture, and how well the model has been optimized for Huawei's CANN software stack. Workloads that have been specifically optimized for Ascend hardware - as DeepSeek's models reportedly are - can achieve higher utilization than generic ports from CUDA.

The memory bandwidth gap is the most significant technical limitation. At ~1,800 GB/s, the 910C has roughly 54% of the H100's 3,350 GB/s. For memory-bandwidth-bound inference workloads (which most LLM serving is), this directly translates to lower tokens-per-second per chip. The 910C partially compensates with its larger 96GB memory, which can reduce the need for multi-chip configurations on some model sizes.

Key Capabilities

Domestic Manufacturing Independence. The 910C's most important capability is not a spec - it is the fact that it exists at all. Manufactured on SMIC's 7nm process using DUV (deep ultraviolet) lithography rather than the EUV lithography that TSMC, Samsung, and Intel use for their most advanced nodes, the 910C demonstrates that China can produce competitive (if not leading-edge) AI accelerators despite US export controls. The DUV approach requires multi-patterning techniques that increase cost and reduce yield, but Huawei and SMIC have made it work at production scale. This gives Chinese AI companies a supply chain that no US policy action can disrupt.

CANN Software Ecosystem. Huawei's CANN (Compute Architecture for Neural Networks) is the software stack that makes Ascend hardware programmable. It includes a neural network compiler, operator libraries, and framework adapters for PyTorch, TensorFlow, and MindSpore (Huawei's own framework). CANN has improved significantly since the 910B generation, with better PyTorch compatibility and more optimized operators. DeepSeek's decision to optimize V4 for Ascend hardware has forced rapid maturation of the CANN stack, because frontier model training is the most demanding test of any AI software platform. The CANN ecosystem is still far smaller than CUDA's, but it is now production-grade for Transformer workloads.

96GB HBM2e Memory. The increase from the 910B's 64GB to 96GB is significant for inference workloads. A 70B-parameter model in FP16 requires approximately 140GB - too large for a single 910C, but the 96GB capacity means that with INT8 quantization (which halves memory requirements to ~70GB), the model fits on a single chip. The 910B's 64GB could not manage this even with INT8. For Chinese inference providers serving domestic models like Qwen, Yi, and DeepSeek, the 910C's memory capacity meaningfully expands the set of models that can be served with minimal tensor parallelism.

Pricing and Availability

Huawei does not publish official pricing for the Ascend 910C. Market estimates place the price at $12,000 to $18,000 per accelerator, though actual pricing varies significantly based on volume, customer relationship, and whether the purchase includes Huawei's Atlas server platform or standalone accelerators.

AcceleratorEstimated PriceMemoryFP16 TFLOPSProcess Node
Huawei Ascend 910C$12,000-$18,00096GB~800SMIC 7nm
Huawei Ascend 910B$8,000-$12,00064GB~600SMIC 7nm
NVIDIA H100 SXM$25,000-$40,00080GB990TSMC 4nm
NVIDIA A100 80GB$10,000-$15,00080GB312TSMC 7nm
AMD MI300X$10,000-$15,000192GB1,307TSMC 5nm/6nm

The 910C is available primarily through Huawei's Atlas server platforms and through Chinese cloud providers including Huawei Cloud, Alibaba Cloud (for government and enterprise customers), and various state-backed data center operators. Export restrictions make the 910C effectively unavailable outside of China and a limited number of countries.

Availability within China has reportedly been strong, with Huawei ramping production through 2025. Several Chinese hyperscalers have placed large orders, and the Chinese government has provided subsidies and procurement guarantees that ensure demand. The supply constraint is not Huawei's assembly capacity but SMIC's 7nm wafer output, which remains limited compared to TSMC's advanced nodes.

Architecture Deep Dive

The 910C's architecture is a study in optimization under constraint. Huawei's HiSilicon design team cannot access TSMC's advanced nodes, EUV lithography, or the latest packaging technologies that AMD and NVIDIA use. Every transistor must be manufactured on SMIC's 7nm DUV (deep ultraviolet) process, which limits transistor density, clock speed, and power efficiency relative to TSMC 4nm or 3nm. The 910C's architectural gains over the 910B come entirely from circuit design, memory integration, and packaging improvements within this constrained process.

Da Vinci Core Architecture. The Ascend 910 series uses Huawei's Da Vinci AI core architecture, which is organized around a 3D Cube Computing Engine - a systolic array-like structure optimized for matrix multiplication. Each Da Vinci core contains:

ComponentFunctionEstimated Specs (910C)
Cube UnitDense matrix multiplication (FP16, BF16, INT8)16x16x16 cube, ~25 TFLOPS FP16 per core
Vector UnitElement-wise operations, activation functions, normalization256-bit vector width (est.)
Scalar UnitControl flow, addressing, scalar mathStandard RISC pipeline
Local BufferOn-chip SRAM for operand staging~512KB per core (est.)

The 910C is estimated to contain 32 Da Vinci cores (same count as the 910B), but with each core running at higher clock speeds and improved microarchitecture. The per-core throughput improvement is estimated at 30-35%, which - across 32 cores - accounts for the chip-level improvement from ~600 TFLOPS to ~800 TFLOPS FP16.

Memory Subsystem Upgrade. The memory upgrade from 910B to 910C is arguably more significant than the compute improvement:

Memory PropertyAscend 910BAscend 910CImprovement
Capacity64GB HBM2e96GB HBM2e+50%
Stacks4x HBM2e6x HBM2e (est.)+50%
Per-Stack Bandwidth~300 GB/s~300 GB/sSame (same HBM2e gen)
Aggregate Bandwidth~1,200 GB/s~1,800 GB/s+50%
HBM GenerationHBM2eHBM2eSame
Memory ControllerDa Vinci gen 1Da Vinci gen 2 (est.)Improved efficiency

The bandwidth increase is proportional to the stack count increase - Huawei added more HBM2e stacks rather than moving to a faster HBM generation. This is a practical decision given supply chain constraints: HBM3 stacks from SK Hynix and Samsung are subject to US export controls and may not be available to Huawei. Using more HBM2e stacks that can be sourced domestically or from non-restricted suppliers is a more reliable approach, even if it means forgoing the per-stack bandwidth improvements of HBM3.

HCCS Interconnect. Huawei's HCCS (Huawei Cache Coherence System) provides chip-to-chip communication for multi-accelerator configurations. Details are sparse, but available information suggests:

Interconnect PropertyHCCS (910C)NVLink 4.0 (H100)Infinity Fabric (MI300X)
Bidirectional Bandwidth (per chip)~400 GB/s (est.)900 GB/s896 GB/s
Max Connected Chips8 (in Atlas 900 node)8 (DGX H100)8 (MI300X OAM platform)
TopologyRing/mesh (est.)NVSwitch crossbarPoint-to-point mesh
Inter-nodeProprietary (RoCE-based)InfiniBand/EthernetInfiniBand/Ethernet

The HCCS bandwidth gap is significant. At approximately 400 GB/s, the 910C's interconnect provides less than half the bandwidth of NVLink or Infinity Fabric. For workloads that require heavy all-reduce communication (large-batch distributed training), this bottleneck limits scaling efficiency beyond a single node. DeepSeek and others have reportedly worked around this limitation through communication-efficient training algorithms that reduce the frequency and volume of inter-chip data movement.

SMIC 7nm Process Constraints. SMIC's 7nm (marketed as "N+2") process uses DUV multi-patterning rather than EUV single-patterning. The practical implications:

PropertySMIC 7nm (DUV)TSMC 4nm (EUV)Impact on 910C
Transistor Density~90 MTr/mm2 (est.)~130 MTr/mm2~30% fewer transistors per mm2
Max Practical Die Size~400 mm2 (est.)~800 mm2Limits on-chip resources
Power Efficiency~0.7x TSMC 4nm (est.)BaselineHigher watts per TFLOP
Yield (large dies)LowerHigherHigher per-chip cost
Clock SpeedLimitedHigher achievableLimits per-core throughput

These constraints mean that the 910C simply cannot match the transistor count, clock speed, or power efficiency of TSMC-manufactured chips in the same generation. Huawei compensates through architectural optimization and by accepting higher power consumption per TFLOP - the 910C's 600W TDP is moderate in absolute terms but delivers fewer TFLOPS per watt than the H100 on compute-bound workloads.

Atlas Server Platform Integration. The 910C is deployed through Huawei's Atlas series server platforms, which are purpose-built for Ascend accelerators:

PlatformAscend ChipsTotal MemoryUse Case
Atlas 800 Training Server8x 910C768GB HBM2eDistributed training
Atlas 800 Inference Server8x 910C768GB HBM2eHigh-throughput inference
Atlas 900 AI Cluster64x 910C (8 nodes)6TB HBM2eLarge-scale pre-training
Atlas 200 Edge Module1x 910C (or lower)96GB HBM2eEdge inference (limited)

The Atlas platform provides integrated cooling, power delivery, and HCCS interconnect backplane - similar to how NVIDIA's DGX systems integrate NVLink and NVSwitch. The vertical integration means that purchasing 910C chips outside of Atlas platforms is difficult, tying hardware decisions to Huawei's server ecosystem.

Comparison with Original Ascend 910 (Pre-Sanctions). The original Ascend 910, announced in 2019, was designed for TSMC 7nm manufacturing. When US sanctions blocked TSMC access, Huawei redesigned the chip for SMIC's 7nm DUV process. The 910B and 910C represent this redesigned lineage:

PropertyAscend 910 (original, TSMC)Ascend 910B (SMIC)Ascend 910C (SMIC)
ProcessTSMC 7nm EUVSMIC 7nm DUVSMIC 7nm DUV
FP16 Compute~256 TFLOPS~600 TFLOPS~800 TFLOPS
Memory32GB HBM2e64GB HBM2e96GB HBM2e
StatusLimited productionMass productionMass production
TDP310W400W600W

The architectural improvements from 910 to 910B/910C are substantial and demonstrate that HiSilicon's design team has continued to improve the Da Vinci architecture despite the process node constraint. The 910C delivers approximately 3x the compute of the original 910 on the same effective process generation - a testament to microarchitectural optimization.

Real-World Performance Analysis

Training Performance on Chinese Frontier Models. The 910C's most important performance validation comes from its use in training Chinese frontier models. While exact numbers are scarce, leaked benchmarks and industry reports provide approximate comparisons:

Training WorkloadAscend 910C (est.)NVIDIA H100 SXMPerformance Ratio
Qwen 72B (tokens/sec/chip)~800-1,000~1,600-2,000~0.5x
DeepSeek-style MoE (per chip)~900-1,100~1,800-2,200~0.5x
GPT-3 175B (per chip)~600-800~1,400-1,700~0.4-0.5x
BERT Large fine-tuning~400-500 samples/s~800-1,000 samples/s~0.5x

The consistent ~0.5x ratio to the H100 on training workloads reflects the combination of lower compute (800 vs 990 TFLOPS FP16) and significantly lower memory bandwidth (1,800 vs 3,350 GB/s). For memory-bandwidth-bound operations like attention computation in long-context training, the gap widens further.

Inference Performance. For inference serving, the 910C's 96GB memory is its most valuable asset:

ModelPrecision910C Throughput (est.)H100 ThroughputNotes
Qwen 72BINT8 (~70GB)~15-20 tok/s~45-55 tok/sSingle chip, 910C memory-bandwidth limited
ChatGLM-6BFP16 (~12GB)~80-100 tok/s~200-250 tok/sSmall model, compute-limited
DeepSeek V3 (MoE)INT8Multi-chipMulti-chipBoth require distributed serving
Yi-34BINT8 (~34GB)~25-35 tok/s~60-80 tok/sSingle chip on both

The pattern is consistent: the 910C delivers approximately 35-45% of H100 inference throughput per chip. The bandwidth bottleneck (1,800 vs 3,350 GB/s) is the primary limiter for the autoregressive decode phase of LLM inference, which is dominated by memory reads.

CANN Software Stack Performance. The CANN ecosystem has matured substantially between the 910B and 910C generations. Key framework support status:

Framework/LibraryCANN Support StatusNotes
PyTorch (via Ascend adapter)Production-gradeMost operators supported, some gaps remain
MindSpore (Huawei native)Full supportBest-optimized path for Ascend hardware
TensorFlowFunctionalLess investment than PyTorch/MindSpore
vLLMCommunity port (experimental)Not officially supported, limited features
DeepSpeedAdapted (DeepSeek fork)Modified for HCCS communication patterns
ONNX RuntimeFunctionalCANN execution provider available
Flash Attention equivalentCANN-native implementationLower performance than CUDA Flash Attention

The biggest CANN gap versus CUDA is in inference serving engines. vLLM, TensorRT-LLM, and SGLang - the workhorses of Western inference infrastructure - either do not support CANN or have experimental support only. Chinese inference providers have built custom serving solutions on top of CANN's lower-level APIs, but these are not open-source and not transferable across organizations.

Scaling Efficiency at Cluster Scale. For training workloads that require hundreds or thousands of 910C chips, scaling efficiency becomes the critical metric. The HCCS interconnect's lower bandwidth means that communication overhead grows faster with chip count than on NVLink-based systems:

Cluster SizeEstimated Scaling Efficiency (910C)Estimated Scaling Efficiency (H100)Gap
8 chips (1 node)~90-95%~95-98%Small
64 chips (8 nodes)~75-85%~88-93%Moderate
256 chips (32 nodes)~60-75%~80-88%Significant
1,024 chips (128 nodes)~45-60%~70-82%Large
2,048+ chips~35-50%~65-78%Very large

These are rough estimates that vary significantly by workload. MoE (Mixture of Experts) architectures like DeepSeek V3 can achieve better scaling efficiency because they require less all-reduce communication per training step. This partly explains why DeepSeek chose the MoE architecture for their frontier models - it is inherently more communication-efficient, which plays to the 910C's strengths relative to its interconnect limitations.

Domestic HBM Supply Chain. One of the most significant supply chain risks for the 910C is memory. HBM2e stacks are manufactured primarily by SK Hynix and Samsung, both South Korean companies that are subject to US pressure regarding technology exports to China. Huawei has been working to qualify domestic HBM alternatives from Chinese memory manufacturers including CXMT (ChangXin Memory Technologies), but domestic HBM production is estimated to be 2-3 generations behind Korean manufacturers. If HBM2e exports to Huawei are restricted in the future, the 910C's memory subsystem could face supply constraints that limit production volume.

Generational and Competitive Context

vs. Huawei Ascend 910B. The 910C is a meaningful but evolutionary upgrade over the 910B. The 50% memory increase (64GB to 96GB) and 50% bandwidth increase (1,200 to 1,800 GB/s) are the most impactful improvements for inference workloads. The ~33% compute improvement matters for training. For existing 910B operators, the 910C is a straightforward upgrade - CANN applications targeting the 910B run on the 910C with recompilation. New deployments in China should default to the 910C unless budget constraints push toward the cheaper 910B.

vs. NVIDIA H100 SXM. The 910C is not and will never be an H100 equivalent. It delivers roughly 50-60% of the H100's performance at 50-60% of the H100's price (if the H100 were available in China, which it is not). The comparison is academic for Chinese organizations because the H100 is not an option. The relevant question is whether the 910C is good enough to train and serve competitive models - and DeepSeek's results demonstrate that it is, when combined with sufficient scale and software optimization.

vs. AMD MI300X. The MI300X outperforms the 910C in every measurable dimension: 2x memory (192GB vs 96GB), 3x bandwidth (5,300 vs 1,800 GB/s), and ~1.6x compute (1,307 vs ~800 TFLOPS FP16). The MI300X is also less expensive on a per-TFLOP basis. But the MI300X is subject to US export controls and unavailable in China, making this comparison purely theoretical for the 910C's target market.

vs. Google TPU v6e Trillium. The TPU v6e is cloud-only and available exclusively on Google Cloud - completely inaccessible to Chinese organizations. Architecturally, Trillium's 32GB per chip is less than the 910C's 96GB, but Trillium's pod-scale ICI interconnect at 256 chips provides a very different scaling story. These products serve entirely separate markets.

Export Control Implications. The 910C exists because of - not despite - US export controls. The October 2022 controls (updated in October 2023) banned the sale of accelerators above specific performance thresholds to Chinese entities. This forced Chinese AI companies to choose between stockpiling pre-ban NVIDIA chips (which many did), slowing down AI development, or adopting domestic alternatives like the Ascend line. The 910C's specifications are carefully designed to provide maximum performance within what SMIC's 7nm process can deliver. Any future tightening of controls (for example, restrictions on HBM2e sales to China) could impact the 910C's memory subsystem, though Huawei has been working to qualify domestic HBM suppliers.

DeepSeek's Ascend Strategy. DeepSeek's decision to optimize V4 for Ascend hardware is the most significant validation of the 910C ecosystem. DeepSeek's approach involved several key innovations that extract maximum performance from Ascend constraints:

  • Communication-efficient training algorithms that reduce all-reduce frequency, minimizing HCCS bandwidth limitations
  • Custom operator implementations in CANN that bypass generic framework overhead
  • Mixed-precision strategies optimized for the Da Vinci core's FP16/INT8 capabilities
  • Memory-efficient attention implementations that work within the 96GB budget

These optimizations are not transferable to other hardware, which deepens the Ascend ecosystem lock-in but also proves that the hardware is capable when software is purpose-built for it.

Future Ascend Roadmap. Huawei has not disclosed detailed roadmap information for next-generation Ascend chips. Industry analysts expect a 910D or next-generation Ascend within 12-18 months, likely on an improved SMIC process (possibly 5nm DUV, though SMIC's 5nm readiness is unconfirmed). Key targets for the next generation would include:

  • HBM3 or equivalent memory for higher bandwidth (addressing the 1,800 GB/s bottleneck)
  • FP8 hardware support (closing the gap with NVIDIA and AMD)
  • Improved HCCS interconnect bandwidth (addressing the scaling efficiency gap)
  • Higher core count or clock speed for increased compute throughput

Whether Huawei can deliver on these targets depends largely on SMIC's process development progress and HBM supply chain availability - both factors that are outside Huawei's direct control.

Use Case Recommendations

Strong Fit:

  • Chinese AI companies training domestic models. If you are operating within China and need to train models at 70B+ scale, the 910C is the most capable domestic option. The 96GB memory and ~800 TFLOPS FP16 provide sufficient compute for frontier model training when scaled to hundreds or thousands of chips.
  • Chinese cloud providers building inference capacity. For serving Qwen, Yi, DeepSeek, and other domestic models to Chinese users, the 910C's 96GB memory enables single-chip INT8 serving of 70B-class models. The total cost of ownership is competitive given that NVIDIA alternatives are not available.
  • Government and military AI programs. For applications where supply chain sovereignty is non-negotiable, the 910C - manufactured entirely in China - is the only option that provides complete independence from Western technology supply chains.
  • Organizations that have already invested in CANN. If your engineering team has already built CANN expertise on the 910B, the 910C offers a smooth upgrade path with immediate performance gains and no software rewrite.

Weak Fit:

  • Organizations outside China. The 910C is effectively unavailable outside China. Even if you could procure units, the CANN ecosystem is optimized for Chinese frameworks and Chinese-language documentation. Western organizations should look at AMD MI300X, NVIDIA H100, or Google TPU alternatives.
  • Workloads requiring maximum memory bandwidth. At 1,800 GB/s, the 910C's bandwidth bottleneck limits LLM inference throughput to roughly half of what an H100 or MI300X can achieve. If per-chip inference throughput is the critical metric, the 910C underperforms.
  • Teams without CANN expertise. The CANN learning curve is steep, especially for teams coming from CUDA. English-language documentation is limited, community resources are sparse, and debugging tools are less mature than CUDA's. Budget significant onboarding time.
  • Applications requiring FP8 precision. The 910C does not have confirmed FP8 hardware support. If your inference pipeline relies on FP8 quantization for throughput (as most modern serving engines do), you will be limited to INT8 or FP16 on the 910C, with lower effective throughput than FP8-capable alternatives.
  • Multi-modal AI workloads. Vision-language models, audio processing, and other multi-modal pipelines have less CANN optimization than pure language model workloads. The operator coverage for vision encoders, audio feature extractors, and cross-modal attention on CANN is thinner than for standard Transformer layers.

Decision Framework for Chinese Organizations. For AI teams in China choosing between the 910B and 910C:

Deciding FactorChoose 910BChoose 910C
Budget per chipUnder $12,000$12,000-$18,000 acceptable
Target model size (inference)Sub-34B INT834B-70B INT8
Cluster scale needed1,000+ chips (cost matters)Under 500 chips (performance per chip matters)
Existing infrastructureExpanding 910B clustersNew deployment
Primary workloadTraining (large batch)Inference (memory-bound)
Power constraintsLimited facility powerPower budget available
TimelineNeed hardware nowCan wait for production ramp

The 910C is the default choice for new deployments in 2025-2026. The 910B remains relevant for budget-constrained expansions of existing clusters and for training workloads where per-chip performance matters less than aggregate throughput at lower total cost.

Strategic Importance and Investment Trajectory. The Chinese government has committed significant resources to domestic AI chip development. National and provincial subsidies for Ascend hardware procurement reduce the effective cost for Chinese organizations, making the 910C's price-performance equation more favorable than the list pricing suggests. The "New Generation AI Development Plan" and related policy initiatives treat domestic AI chip adoption as a national priority, which provides demand certainty that supports Huawei's continued R&D investment in the Ascend line. This government backing is a structural advantage that no Western AI chip competitor can replicate - it provides a guaranteed customer base regardless of competitive performance gaps.

For the broader global AI industry, the 910C represents a significant data point. It demonstrates that export controls have not halted Chinese AI hardware development - they have redirected it toward domestic alternatives. The performance gap between the 910C and the H100 (roughly 0.5-0.6x) is smaller than many analysts predicted when controls were first imposed. Whether this gap will narrow or widen in future generations depends on SMIC's process development progress and Huawei's ability to source advanced memory and packaging technologies under ongoing restrictions.

Strengths

  • 96GB HBM2e - the highest memory capacity in the Ascend line, enabling INT8 serving of 70B-class models on a single chip
  • Estimated ~800 TFLOPS FP16 - roughly 2x the A100 and competitive for many training workloads
  • Manufactured entirely within China, immune to US export control disruptions
  • CANN software stack has matured significantly with DeepSeek and other frontier model training
  • Growing domestic ecosystem with support from Baidu, Alibaba, Tencent, ByteDance, and DeepSeek
  • 600W TDP is moderate compared to H100's 700W and MI300X's 750W
  • Strategic importance ensures continued Chinese government investment and subsidies

Weaknesses

  • Memory bandwidth (~1,800 GB/s) is roughly 54% of the H100's 3,350 GB/s - a significant bottleneck for inference
  • SMIC 7nm process trails TSMC by 2+ technology generations, limiting transistor density and power efficiency
  • CANN software ecosystem is far smaller than CUDA - fewer libraries, tools, and community resources
  • No FP8 hardware support confirmed, limiting performance on quantized inference workloads
  • Higher estimated price-per-TFLOP than the AMD MI300X despite lower absolute performance
  • Effectively unavailable outside China due to export restrictions and supply priorities
  • Multi-chip interconnect (HCCS) bandwidth is not competitive with NVLink or Infinity Fabric for large-scale training

Sources

Huawei Ascend 910C
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.