Hardware

Cambricon MLU590 - China's Inference Accelerator

Full specs and analysis of the Cambricon MLU590 - 192GB HBM2e, ~2,400 GB/s bandwidth, TSMC 7nm, and what it means for AI inference outside the NVIDIA ecosystem.

Cambricon MLU590 - China's Inference Accelerator

TL;DR

  • 192GB HBM2e at ~2,400 GB/s - more memory and bandwidth than any consumer GPU, positioned as a datacenter inference card
  • ~780 TFLOPS estimated FP8 performance - competitive with NVIDIA's datacenter GPUs on paper, but software ecosystem is the real question
  • TSMC 7nm process at 450W TDP - two generations behind NVIDIA's current process node
  • Estimated pricing of $8,000-$12,000 puts it firmly outside consumer territory, but relevant for small business and research lab deployments in China
  • Gaining strategic importance as US export controls push Chinese AI companies toward domestic alternatives to NVIDIA

Overview

The Cambricon MLU590 is not a home lab GPU. At ~$8,000-$12,000 and 192GB of HBM2e, it is a datacenter inference accelerator that competes with NVIDIA's A100 and H100 in China's domestic market. I am including it in this hardware comparison because it represents something important for the broader AI hardware landscape: the emergence of a non-NVIDIA inference ecosystem that is actively being adopted by some of the largest AI deployments in the world, including companies like DeepSeek that are optimizing frontier models specifically for this hardware.

Cambricon (Zhongke Hanxin) is a Beijing-based chip designer that has been building neural network processors since 2016. The company was co-founded by Chen Tianshi and Chen Yunji, brothers who developed some of the earliest neural network accelerator architectures at the Chinese Academy of Sciences. The MLU590 is the flagship product in their MLU500 series, built on TSMC's 7nm process with 192GB of HBM2e memory. The raw specs are impressive on paper - ~2,400 GB/s of memory bandwidth and an estimated 780 TFLOPS of FP8 compute put it in the same general territory as NVIDIA's H100 (3,350 GB/s, 1,979 TFLOPS FP8). The gap is real, but it is narrower than many Western observers assume.

Where the MLU590 falls behind is software. NVIDIA's CUDA ecosystem has 18 years of library development, compiler optimization, and community tooling. Cambricon's BANG (Basic Architecture for Next Generation) programming model and CNToolkit are functional but immature by comparison. Model porting from CUDA to BANG requires non-trivial engineering effort. That said, DeepSeek V4 was reportedly optimized for Cambricon and Huawei Ascend chips from the ground up, suggesting that at least some frontier Chinese AI labs have committed to building software stacks around domestic hardware. If V4's performance claims hold, the MLU590's software ecosystem gains a powerful validation case.

Key Specifications

SpecificationDetails
ManufacturerCambricon (Zhongke Hanxin)
Product FamilyMLU500 Series
ArchitectureMLUarch03
Process NodeTSMC 7nm
Compute Clusters32 (estimated)
Memory192GB HBM2e
Memory Bandwidth~2,400 GB/s
FP8 Performance~780 TFLOPS (estimated)
FP16 Performance~390 TFLOPS (estimated)
INT8 Performance~780 TOPS (estimated)
BF16 Performance~390 TFLOPS (estimated)
TDP450W
Power ConnectorDual 8-pin or custom
Form FactorDual-slot PCIe card (passive cooling)
PCIe InterfacePCIe 5.0 x16
InterconnectMLU-Link (proprietary, multi-card)
Programming ModelBANG (Basic Architecture for Next Generation)
Software StackCNToolkit, CNNL, MagicMind
Estimated Price$8,000-$12,000
Release DateH1 2024
Primary MarketChina domestic

Performance Benchmarks

Direct benchmarks comparing the MLU590 to NVIDIA GPUs are scarce. Most available data comes from Cambricon's own publications, Chinese AI company disclosures, and limited third-party testing. The following tables include estimated figures based on available information and should be treated with appropriate skepticism.

Raw Compute Comparison

SpecificationMLU590NVIDIA H100 (SXM)NVIDIA A100 (80GB)RTX 5090
FP8 TFLOPS~7801,979N/A (no FP8)~400
FP16 TFLOPS~390990312~200
BF16 TFLOPS~390990312~200
INT8 TOPS~7801,979624~400
Memory192GB HBM2e80GB HBM380GB HBM2e32GB GDDR7
Memory Bandwidth~2,400 GB/s3,350 GB/s2,039 GB/s1,792 GB/s
TDP450W700W400W575W
Process Node7nm4nm7nm4NP

The H100 is faster across every compute metric - roughly 2.5x on FP8 and FP16. But the H100 also draws 250W more power, costs 2-3x more, and most critically for the Chinese market, is not available for purchase due to US export controls. The relevant comparison for Chinese buyers is not "MLU590 vs H100" but "MLU590 vs nothing" or "MLU590 vs pre-ban A100 stock at inflated prices."

Estimated Inference Performance

ModelMLU590 (192GB)NVIDIA H100 (80GB)NVIDIA A100 (80GB)RTX 5090 (32GB)
Llama 3.1 70B Q4_K_M (est. tok/s)~40-50~80-100~45-60~22*
Llama 3.1 70B FP16 (est. tok/s)~20-25~40-55~25-35N/A (VRAM)
Llama 3.1 8B FP16 (est. tok/s)~200-250~400-500~250-300N/A (pathway)
Mixtral 8x7B Q4_K_M (est. tok/s)~60-75~120-150~70-90~45
Max Model (FP16, no quant)~90B~38B~38B~15B
Max Model (Q4_K_M)~300B+~130B~130B~55B

*RTX 5090 uses aggressive Q3_K_M quantization to fit 70B models in 32GB. The MLU590 runs at full Q4_K_M quality with massive memory headroom.

Important caveat: These inference numbers are estimates based on the compute-to-bandwidth ratio and limited available data. Actual performance depends heavily on software optimization. A well-optimized BANG kernel can approach the theoretical throughput, but a naive port from CUDA may achieve only 40-60% of theoretical performance. The software stack is the critical variable, not the hardware specs.

Memory Capacity Advantage

The MLU590's 192GB of HBM2e is its standout competitive advantage. Here is what that memory capacity enables compared to consumer hardware:

Use CaseMemory RequiredMLU590RTX 5090M4 Max 128GB
Llama 3.1 70B FP16~140GBYesNoNo
Llama 3.1 70B Q4_K_M~40GBYesNo (32GB)Yes
Mixtral 8x22B Q4_K_M~80GBYesNoYes (tight)
DeepSeek V3.2 Q4_K_M (37B active)~22GBYesYesYes
Multiple 70B models simultaneously~80GB+YesNoYes (128GB)
70B model + large KV cache (32K context)~60GBYesNoYes

At 192GB, the MLU590 can hold a 70B model in full FP16 precision - something that requires three H100s or is simply impossible on any consumer hardware. For research teams that need to evaluate models without quantization artifacts, or for deployments that require maximum quality from large models, this memory capacity is a fundamental enabler.

Real-World Deployment Scenarios

While the MLU590 is not consumer hardware, understanding how it is deployed provides context for its role in the AI hardware landscape.

Chinese Cloud Inference

Chinese cloud providers (Alibaba Cloud, Tencent Cloud, Baidu AI Cloud) have integrated MLU590 hardware into their AI inference offerings. These cloud instances provide MLU590 access to developers who do not want to procure and manage the hardware directly. The cloud pricing is competitive with equivalent A100 instances in China, partly because the MLU590's 192GB memory reduces the number of cards needed for large model serving.

For a deployment serving Qwen2.5 72B, an MLU590 can hold the entire model in memory on a single chip at FP16 (approximately 144GB for weights). An A100 (80GB) requires at least two cards with tensor parallelism. This card-count reduction translates directly to lower rack space, power, and operational costs.

University Research

Chinese universities - including Tsinghua, Peking, Zhejiang, and the Chinese Academy of Sciences - have procured MLU590 clusters for AI research. For academic groups, the MLU590's availability and Cambricon's willingness to provide academic pricing make it a practical alternative to hoarding pre-ban NVIDIA stock.

Research workflows typically involve model evaluation, architecture experimentation, and small-scale training. The 192GB memory is particularly valuable for researchers who need to load large models in full precision for analysis, ablation studies, and architecture comparison work.

DeepSeek and Frontier Model Serving

The highest-profile MLU590 deployment is DeepSeek's optimization of V4 for Cambricon hardware. DeepSeek's engineering team reportedly built custom inference kernels in BANG that are optimized for V4's specific architecture - the Mixture-of-Experts routing, Multi-head Latent Attention, and Engram Conditional Memory operations. These optimizations are not general-purpose; they are tuned to V4's specific computation patterns.

If DeepSeek's V4 inference cost estimates (~$0.14/M input tokens, ~$0.28/M output tokens) are accurate, it would demonstrate that frontier model serving is economically viable on Cambricon hardware. This has implications beyond DeepSeek - it suggests that the software gap, while real, is bridgeable for teams willing to invest in custom kernel development.

Key Capabilities

192GB HBM2e Memory Pool. The MLU590's 192GB of HBM2e is its most consequential specification. No consumer GPU comes close - the RTX 5090 tops out at 32GB, and even NVIDIA's H100 datacenter GPU has only 80GB (or 141GB in the H200 variant). This massive memory pool enables running very large models at high precision without quantization, holding multiple models simultaneously for serving, and processing extremely long contexts without memory pressure. For Chinese AI companies running frontier models like DeepSeek V4, this memory capacity is a fundamental enabler.

The HBM2e memory technology also provides consistent bandwidth characteristics. Unlike GDDR memory, which can have variable latency depending on access patterns, HBM2e provides uniform access across the entire 192GB address space. This predictability matters for inference serving at scale, where consistent latency is more important than peak throughput.

Export Control Resilience. The MLU590 exists because of US export controls on advanced AI chips. Since October 2022, successive rounds of restrictions have limited NVIDIA and AMD's ability to sell their most capable GPUs to Chinese customers. The initial October 2022 rule targeted chips exceeding specific compute density thresholds, which blocked the A100 and H100. The October 2023 update expanded the net further, blocking the A800 and H800 (China-specific variants NVIDIA had developed to comply with the original rules). The January 2025 rules added additional restrictions on chip interconnects and manufacturing equipment.

The MLU590, designed and sold by a Chinese company (though manufactured at TSMC), is not subject to these restrictions. For Chinese AI labs, universities, and cloud providers, it represents supply chain independence - a domestically controllable hardware platform that cannot be cut off by geopolitical decisions. This strategic dimension matters as much as the technical specs for the card's adoption trajectory.

DeepSeek Optimization. The most significant recent development for the MLU590 ecosystem is DeepSeek's decision to optimize V4 for Cambricon and Huawei Ascend chips. If a frontier model with trillion-scale parameters runs optimally on the MLU590, it validates the entire Cambricon software stack and gives other Chinese AI companies confidence to adopt the platform. This is the kind of ecosystem anchor that CUDA had with every major model developer supporting NVIDIA first - except here it is happening in the opposite direction, with a model developer choosing domestic chips over NVIDIA.

The DeepSeek relationship also provides practical software benefits. DeepSeek's engineering team has reportedly contributed optimized inference kernels and operator implementations to Cambricon's software stack as part of the V4 optimization effort. These contributions improve performance not just for DeepSeek's models but for any workload using similar operations - attention mechanisms, MoE routing, KV cache management.

MLU-Link Interconnect. For multi-chip deployments, the MLU590 supports MLU-Link - Cambricon's proprietary chip-to-chip interconnect. While detailed specifications are not publicly available, MLU-Link enables tensor parallelism across multiple MLU590 cards within a server, similar to NVIDIA's NVLink. For training or inference on models that exceed a single chip's memory (models above ~90B parameters in FP16), MLU-Link enables scaling across 4 or 8 cards.

BANG Programming Model. BANG is Cambricon's equivalent of CUDA - a C/C++ based programming model for writing custom kernels that run on the MLU hardware. While significantly less mature than CUDA, BANG provides the low-level access needed to optimize inference kernels for specific model architectures. The learning curve for CUDA programmers is moderate - the core concepts (grids, blocks, shared memory) translate, but the hardware-specific optimizations and memory hierarchy differ.

Software Ecosystem

The MLU590's software stack is its most significant limitation compared to NVIDIA hardware. Here is an honest assessment of where things stand.

What Works

  • CNToolkit: Cambricon's equivalent of CUDA Toolkit. Includes compiler, runtime, profiler, and debugger. Functional for basic kernel development and deployment.
  • CNNL (Cambricon Neural Network Library): Pre-optimized operators for common neural network operations - matrix multiplication, convolution, activation functions, attention mechanisms. Covers the operations needed for most transformer inference.
  • MagicMind: High-level inference optimization framework, similar to NVIDIA's TensorRT. Accepts ONNX models and produces optimized inference graphs for MLU hardware. Supports quantization, graph optimization, and multi-stream execution.
  • PyTorch/CNTORCH: Cambricon provides a modified PyTorch distribution (CNTORCH) that routes tensor operations to MLU hardware. Model porting from standard PyTorch is possible with varying levels of effort depending on which operations the model uses.

What Is Immature

  • Community Tools: There is no equivalent of llama.cpp, vLLM, or Ollama for MLU hardware. Inference requires using Cambricon's own tools or building custom solutions.
  • Model Coverage: Not every model architecture has optimized operators in CNNL. Newer operations (flash attention variants, MoE routing, speculative decoding) may need custom kernel development.
  • Documentation: While Cambricon provides documentation, it is primarily in Chinese. English-language documentation exists but is less comprehensive. Community forums and knowledge bases are smaller than CUDA's by orders of magnitude.
  • Third-Party Integration: Most popular AI/ML tools assume CUDA. Integration with common MLOps platforms, monitoring tools, and deployment frameworks requires custom work.

The Software Gap in Practice

For a team deploying a well-known model architecture (Llama, Mistral, Qwen) on the MLU590, the software stack is workable. MagicMind can optimize the ONNX graph, CNNL provides the core operators, and the result runs at reasonable (though not optimal) performance. The pain point is on the edges - custom attention patterns, novel architectures, debugging performance issues, and the general velocity of iteration that CUDA's mature ecosystem enables.

For a home lab user accustomed to running ollama pull llama3.1 and having inference working in 30 seconds, the MLU590 is in a different universe. This is enterprise and research hardware that requires engineering effort to deploy and optimize.

Geopolitical Context

Understanding the MLU590 requires understanding the AI chip export control landscape. Here is a brief timeline of how we got here:

October 2022: The Bureau of Industry and Security (BIS) publishes the first round of advanced computing export controls. NVIDIA's A100 and H100 are effectively blocked from sale to Chinese entities. AMD's MI250X is similarly restricted.

November 2022 - March 2023: NVIDIA develops China-specific variants (A800, H800) with reduced interconnect bandwidth to comply with the rules. These sell briskly in China.

October 2023: BIS updates the rules, closing the A800/H800 loophole. Virtually all high-end NVIDIA and AMD datacenter GPUs are now blocked.

2024: Chinese AI companies accelerate adoption of domestic alternatives. Huawei Ascend 910B and Cambricon MLU590 become primary procurement targets. DeepSeek reportedly begins optimizing V4 for Cambricon and Ascend hardware.

January 2025: Additional BIS restrictions target semiconductor manufacturing equipment and chip interconnects, further tightening the technology squeeze.

Early 2026: DeepSeek announces V4 - the first frontier model optimized primarily for Chinese hardware. Cambricon and Huawei are the primary hardware targets. NVIDIA GPU support is secondary.

The MLU590 sits at the center of this geopolitical shift. Its adoption is driven as much by supply chain security as by technical merit. Chinese buyers are not choosing the MLU590 because it is better than an H100 - they are choosing it because it is available, domestically supported, and cannot be retroactively blocked by a policy change in Washington.

Pricing and Availability

The MLU590 is primarily available through Cambricon's direct sales channels and Chinese cloud providers. It is not sold through retail channels and is not readily available outside China. Estimated pricing of $8,000-$12,000 per card puts it in the same range as NVIDIA's A100 (which sells for $10,000-$15,000 when available in China, often at significant markups on the gray market) but well below the H100's $25,000-$30,000 price tag.

Price Comparison in the Chinese Market

AcceleratorEst. Price (China)Memory$/GB MemoryAvailability
MLU590$8,000-$12,000192GB$42-$63/GBReadily available
Huawei Ascend 910B$8,000-$15,00064GB$125-$234/GBAvailable (government priority)
NVIDIA A100 (80GB)$10,000-$15,00080GB$125-$188/GBRestricted (pre-2023 stock)
NVIDIA H100 (80GB)$25,000-$30,00080GB$313-$375/GBEffectively unavailable
NVIDIA H800$15,000-$20,00080GB$188-$250/GBRestricted (pre-Oct 2023 stock)
RTX 4090 (export variant)$2,500-$4,00024GB$104-$167/GBGray market, limited

For Chinese organizations, the MLU590 offers the best memory-per-dollar ratio available without export control risk. The 192GB at ~$42-$63 per GB is roughly 3x more cost-effective on memory alone than the Huawei Ascend 910B. Whether that memory advantage translates to real-world performance depends entirely on the maturity of the software stack for your specific workload.

Availability Outside China

For buyers outside China, the MLU590 is essentially inaccessible. Cambricon does not have international distribution channels, the software ecosystem is documented primarily in Chinese, and support infrastructure is domestic. This is a China-market product - relevant for understanding the global AI hardware landscape, but not a purchasing option for a home lab in North America, Europe, or most of Asia outside mainland China.

There are niche exceptions - some Southeast Asian cloud providers and research institutions with ties to Chinese universities have procured MLU590 hardware for specific projects. But these are institutional purchases with direct vendor relationships, not retail availability.

Comparison with Huawei Ascend 910B

The MLU590's closest domestic competitor is Huawei's Ascend 910B. Both are Chinese-designed inference accelerators targeting the same post-export-control market.

SpecificationMLU590Huawei Ascend 910B
Memory192GB HBM2e64GB HBM2e
Memory Bandwidth~2,400 GB/s~1,600 GB/s
FP16 TFLOPS~390~320
BF16 TFLOPS~390~320
INT8 TOPS~780~640
Process NodeTSMC 7nmTSMC 7nm*
TDP450W350W
InterconnectMLU-LinkHCCS
Software StackBANG / CNToolkitCANN / Ascend Toolkit
Price (est.)$8,000-$12,000$8,000-$15,000
Key CustomerDeepSeekHuawei Cloud, Baidu

*The Ascend 910B's exact process node is not publicly confirmed. Some sources suggest it may use SMIC's 7nm-equivalent process for supply chain independence, though this is unverified.

The MLU590 has a clear memory advantage - 3x the memory capacity of the Ascend 910B. For inference workloads where model size exceeds 64GB (70B models in FP16, 100B+ models at any precision), the MLU590 requires fewer cards. The Ascend 910B has a broader software ecosystem thanks to Huawei's larger engineering team and its integration with Huawei Cloud.

Both chips face the same fundamental challenge: competing with an NVIDIA ecosystem that has 18 years of head start. The fact that both Cambricon and Huawei have attracted frontier AI customers (DeepSeek for Cambricon, Baidu and other Huawei Cloud tenants for Ascend) suggests that the domestic ecosystem is crossing minimum viability thresholds, even if it remains years behind CUDA in breadth and maturity.

MLU590 vs Consumer GPUs - Why It Matters

For Western home lab users, the MLU590 is not a purchasing option. But it is important context for understanding the AI hardware landscape.

FactorMLU590RTX 5090M4 Max 128GB
Memory192GB HBM2e32GB GDDR7128GB LPDDR5X
Bandwidth~2,400 GB/s1,792 GB/s546 GB/s
FP16 compute~390 TFLOPS~200 TFLOPS~54 TFLOPS
TDP450W575W~90W (system)
Can run 70B FP16?Yes (single card)NoNo
Can run 70B Q4?YesTight (Q3 only)Yes
Software ecosystemBANG (limited)CUDA (mature)Metal/MLX (growing)
Availability (US/EU)NoYesYes
Price$8,000-$12,000$2,200-$2,500$3,999-$4,399

The MLU590 fills a niche that no consumer product addresses in the Western market: high-memory, high-bandwidth inference acceleration at a price point below NVIDIA's datacenter cards. The closest Western equivalent would be an NVIDIA A100 80GB ($10,000-$15,000), which has less memory at a higher price.

The Bigger Picture

The MLU590 matters because it represents the beginning of a divergence in the global AI hardware ecosystem. For the first two decades of GPU computing, NVIDIA's CUDA was a de facto global standard. Every AI company, everywhere in the world, built on the same hardware and software stack.

That is changing. The MLU590, alongside Huawei's Ascend series, is the foundation of a parallel AI hardware ecosystem that serves the world's second-largest AI market. When DeepSeek V4 launches optimized for Cambricon and Ascend hardware, it will be the first frontier model where the primary inference target is not NVIDIA. If V4 performs as leaked benchmarks suggest - competitive with Claude Opus 4.6 and GPT-5.3 on coding tasks - it proves that frontier AI does not require NVIDIA hardware.

For home lab users in the West, the MLU590 is not a purchasing option. But it is a data point worth understanding. The NVIDIA moat has always been software (CUDA) as much as hardware (Tensor Cores). The MLU590 and the DeepSeek V4 optimization story are the first serious cracks in that moat from a non-NVIDIA ecosystem.

Strengths

  • 192GB HBM2e - more memory than any other single accelerator in its price range
  • ~2,400 GB/s memory bandwidth enables large-model inference at reasonable throughput
  • ~780 TFLOPS FP8 (estimated) puts it in the same general tier as NVIDIA's datacenter cards
  • Not subject to US export controls - reliable supply for Chinese buyers
  • DeepSeek V4 optimization provides a strong ecosystem anchor for the software stack
  • MLU-Link interconnect supports multi-card scaling for even larger deployments
  • Price-to-memory ratio significantly better than NVIDIA or Huawei alternatives in the Chinese market

Weaknesses

  • Software ecosystem (BANG, CNToolkit) is years behind NVIDIA's CUDA in maturity and tooling
  • TSMC 7nm process is two generations behind NVIDIA's current 4N/5nm offerings
  • Porting CUDA-based models and frameworks requires significant engineering effort
  • Limited third-party benchmark data makes performance claims difficult to verify independently
  • Effectively unavailable outside China - no international distribution or support
  • 450W TDP on a 7nm chip suggests lower compute-per-watt efficiency than modern alternatives
  • Community and documentation resources are primarily Chinese-language

Future Outlook

The MLU590 is not Cambricon's final chip. The company has disclosed plans for next-generation products that would use more advanced process nodes and HBM3 memory, narrowing the gap with NVIDIA's current offerings. Several factors will determine whether Cambricon becomes a lasting force in the AI hardware market.

TSMC access. The MLU590 is manufactured on TSMC's 7nm process. Future advanced nodes (5nm, 3nm) are subject to increasing US pressure on TSMC to restrict exports to Chinese chip designers. If Cambricon loses access to leading-edge TSMC nodes, it may need to rely on SMIC's domestic processes, which are approximately two generations behind TSMC. This would widen the performance gap with NVIDIA rather than narrowing it.

Software ecosystem velocity. BANG and CNToolkit are improving, but the rate of improvement matters more than the current state. If Cambricon can attract a critical mass of developers - through DeepSeek's V4 momentum, through university partnerships, through cloud provider adoption - the software ecosystem could reach a viability threshold where the gap with CUDA becomes manageable. If adoption remains concentrated in a few large customers, the ecosystem will remain fragile.

Competition from Huawei. Huawei's Ascend line is Cambricon's most direct competitor in China. Huawei has deeper pockets, a larger engineering team, and the advantage of vertical integration (Huawei Cloud runs on Ascend hardware). If Huawei captures the majority of the domestic market, Cambricon may be squeezed into a niche position. The 192GB memory advantage is a genuine differentiator today, but Huawei's next-generation Ascend is likely to close that gap.

Model ecosystem alignment. The most important factor is whether models continue to be optimized for Cambricon hardware. DeepSeek V4 is a strong start. If Baidu, Alibaba, Tencent, and other Chinese AI companies follow DeepSeek's lead in optimizing for domestic chips, the MLU590 and its successors will have a sustainable market. If Chinese companies continue to prefer NVIDIA hardware (obtained through gray market channels or pre-ban stockpiles) for model development and only use domestic chips for deployment, the optimization story remains incomplete.

Cambricon - Company Background

Understanding Cambricon as a company provides context for the MLU590's position in the market.

Cambricon was founded in 2016 by Chen Tianshi and Chen Yunji, brothers who developed the first deep learning processor architecture at the Institute of Computing Technology (ICT) at the Chinese Academy of Sciences. Their Cambricon-1A neural network unit was integrated into Huawei's Kirin 970 SoC in 2017 - meaning Cambricon's IP was inside tens of millions of smartphones before the company ever built a standalone accelerator.

The company went public on the Shanghai STAR Market in 2020, raising approximately $390 million. It has since invested heavily in the MLU product line and the BANG software ecosystem. Revenue comes primarily from cloud service providers, government and military customers, and enterprise deployments. Cambricon is not profitable - like many Chinese semiconductor companies, it operates at a loss while building market share and technology capability.

The relationship with Huawei is complex. Cambricon's IP was once inside Huawei's chips, but Huawei has since developed its own Da Vinci architecture for the Ascend line, making Huawei both a former partner and a current competitor. The market is large enough for both, but the competition for government and SOE (state-owned enterprise) procurement is intense.

Frequently Asked Questions

Can I buy an MLU590 outside China?

Effectively no. Cambricon does not have international distribution, the software documentation is primarily in Chinese, and support is domestic. Some Southeast Asian institutions have procured MLU590 hardware through direct vendor relationships, but this is not retail availability.

How does the MLU590 compare to the NVIDIA H100?

The H100 is faster on every compute metric (2.5x FP8, 40% more bandwidth) and has a vastly more mature software ecosystem. The MLU590's advantages are price ($8,000-$12,000 vs $25,000-$30,000), memory (192GB vs 80GB), and availability in China (unrestricted vs effectively banned). They serve different markets and different constraints.

Is the MLU590 relevant for Western developers?

Not as a purchasing option, but as context, yes. If DeepSeek V4 runs well on MLU590 hardware and achieves frontier performance, it demonstrates that CUDA is not a hard requirement for frontier AI. This has implications for AMD's ROCm, Intel's oneAPI, and any other project trying to break NVIDIA's dominance. The MLU590 is proof that alternative AI hardware ecosystems can reach production viability with sufficient software investment.

What happens if TSMC is forced to stop manufacturing for Cambricon?

This is the scenario that keeps Chinese chipmakers up at night. If TSMC cuts off Cambricon, the fallback is SMIC's domestic fabrication, which currently tops out at approximately 7nm-equivalent (some reports suggest limited 5nm capability). A transition to SMIC would not immediately disable Cambricon - the current MLU590 is already on 7nm - but it would prevent the company from accessing the 5nm and 3nm nodes needed to close the performance gap with NVIDIA's current 4nm products. The result would be a widening gap in compute efficiency, partially offset by memory capacity advantages (HBM stacks are less process-dependent than logic dies).

How does the MLU590 handle Mixture-of-Experts (MoE) models?

MoE models like Mixtral and DeepSeek V3/V4 are particularly well-suited to the MLU590's 192GB memory. MoE architectures have a large total parameter count but only activate a fraction of experts per token, meaning the memory requirement is high (all experts must be loaded) but the compute requirement per token is moderate. The MLU590's memory capacity allows it to hold the full expert set in memory while its compute throughput handles the per-token expert routing efficiently. DeepSeek's V4 optimization likely includes custom MoE routing kernels that are specifically tuned for the MLU590's compute cluster topology.

Is the MLU590 suitable for training, or only inference?

The MLU590 is primarily marketed as an inference accelerator, but it can be used for small-to-medium scale training. The 192GB memory is sufficient to hold training state (model parameters, gradients, optimizer state) for models up to approximately 30B parameters in full precision, or larger models with gradient checkpointing and mixed precision. However, the software ecosystem for training on BANG is less mature than for inference, and most Chinese AI companies use NVIDIA hardware (where available) or Huawei Ascend for training, reserving the MLU590 for inference serving.

Sources

Cambricon MLU590 - China's Inference Accelerator
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.