Artificial IntelligenceAIAnalysis

DeepSeek V4-Pro on Ascend 950PR: The Two-Stack AI Reality

DeepSeek V4-Pro runs on Huawei Ascend 950PR as the State Department pivots export controls from chip access to model IP, describing two parallel AI stacks.

Two parallel rows of AI server infrastructure diverging from a central fault line, representing frontier AI compute splitting between established and emerging hardware ecosystems. — Huawei's training-class Ascend 950DT -- arrives Q4 2026. The gap between them is the story.SCN / AI Generated

A 1.6-trillion-parameter Mixture-of-Experts model optimized for Huawei's Ascend 950PR inference chip, a hybrid attention architecture cutting per-token compute requirements by 73%, and a Friday diplomatic cable from the U.S. State Department seen by Reuters warning governments worldwide about alleged IP theft through model distillation describe frontier-AI development now operating across two parallel compute stacks with diverging hardware economics and policy postures.

DeepSeek released V4-Pro and V4-Flash on April 24, 2026, the same day the State Department instructed diplomatic posts to raise "concerns over adversaries' extraction and distillation of U.S. A.I. models" and named DeepSeek, Moonshot AI, and MiniMax as subjects of those allegations. V4-Pro is the first frontier-tier open-weight model explicitly engineered for an export-controlled-hardware alternative, and the first formal diplomatic action targeting model-IP rather than chip access.

For systems architects, model developers, and procurement leads managing AI infrastructure budgets, the V4 release carries three load-bearing technical claims that determine its operational viability: a hybrid attention stack (Compressed Sparse Attention, Heavily Compressed Attention, and DeepSeek Sparse Attention) that DeepSeek reports cuts inference FLOPs to 27% of V3.2's requirement at one-million-token context and KV-cache size to 10% of the prior generation; weights quantized to FP4 for Mixture-of-Experts parameters and FP8 for most others; and inference support confirmed on Huawei's Ascend 950PR chip, which entered volume production in Q1 2026 with 128 GB HiBL 1.0 memory at 1.6 TB/s bandwidth. NVIDIA published day-zero performance benchmarks showing over 150 tokens per second per user on the GB200 NVL72, confirming V4-Pro also runs on NVIDIA hardware without modification.

The question is not whether V4-Pro runs. The question is whether it runs economically at scale on non-NVIDIA silicon under production workloads, and whether the policy regime that spent three years tightening hardware export controls has already conceded that those controls were routed around at the model layer.

What V4-Pro Is and Where It Runs

V4-Pro is a 1.6-trillion-parameter Mixture-of-Experts model with 49 billion active parameters per token. V4-Flash is the smaller sibling at 284 billion total parameters with 13 billion active per forward pass. Both carry a one-million-token context window and are released under the MIT license with weights available in MXFP4 + FP8 mixed precision. DeepSeek itself acknowledges V4-Pro lags GPT-5.4 and Claude Opus 4.6 by roughly three to six months of development. It is, however, priced at $1.74 per million input tokens and $3.48 per million output tokens, an order of magnitude below U.S. frontier API pricing.

The architectural claim that matters for infrastructure planners is the efficiency stack. According to DeepSeek's model card and NVIDIA's technical blog post, V4-Pro requires 73% fewer FLOPs per token than V3.2 at one-million-token context and uses 90% less KV-cache memory. Those reductions are the product of the hybrid attention mechanism: Compressed Sparse Attention handles the bulk of the context; Heavily Compressed Attention compresses further for the longest sequences; DeepSeek Sparse Attention routes activation sparsity across the Mixture-of-Experts layers. The combination means a model serving one million tokens of context fits in fewer accelerators and requires less memory bandwidth per query than architectures that apply full attention across the entire window.

On hardware, the Ascend 950 family consists of two SKUs with distinct availability timelines and target workloads:

Chip	Memory Type	Capacity	Bandwidth	FLOPs (FP8)	Target Workload	Availability
Ascend 950PR	HiBL 1.0	128 GB	1.6 TB/s	1 PFLOPS	Prefill-stage inference, recommendation	Q1 2026 (volume production)
Ascend 950DT	HiZQ 2.0	144 GB	4.0 TB/s	1 PFLOPS	Decode-stage inference, model training	Q4 2026 (planned)

*Source: Huawei Connect 2025 keynote by Eric Xu; volume targets from Asia Times citing Huawei MWC 2026 disclosures.*

What is available today is 950PR-class inference hardware. What is not available until Q4 2026 or later is the training-class chip and the large-scale SuperPoD configurations. That gap is not a hedging note. It is the story. V4-Pro on Ascend today means inference on 950PR. Frontier-scale training and long-context decode at one million tokens on Ascend-class bandwidth is a future capability, pending 950DT volume shipments.

DeepSeek has not publicly disclosed V4-Pro's pretraining hardware in the same detail it disclosed for V3. The V3 technical report specified 2,048 NVIDIA H800 GPUs at 2.788 million GPU-hours and approximately $5.6 million in direct compute costs. The V4 technical report, published April 24, 2026 alongside the model release, notes that the company validated its expert-parallel scheme on "both Nvidia GPUs and Ascend NPU platforms" but does not specify which hardware was used for the pretraining run. Reuters previously reported that Chinese government officials recommended DeepSeek integrate Huawei chips in its training process, framing Ascend adoption as state-directed rather than purely a commercial optimization decision. The absence of a training hardware disclosure in the V4 paper, relative to the V3 disclosure standard, is conspicuous.

The On-Policy Distillation Question and the Teacher Model Gap

V4's post-training pipeline introduces On-Policy Distillation, drawing on outputs from ten separate teacher models. According to reporting on the V4 technical paper, DeepSeek first trains specialized in-house models for math, code, agents, and instruction-following using supervised fine-tuning and a reinforcement learning technique called GRPO, then uses a single student model to learn from all of those in-house teachers in a unified consolidation phase.

The identity of those ten teacher models is not publicly specified in available sources. The detail that connects DeepSeek's disclosed OPD methodology to the cable's distillation allegations is whether any of those teachers include outputs from U.S. frontier systems obtained through the alleged fraudulent-account campaigns.

The cable, dated April 24 and seen directly by Reuters reporters Raphael Satter and Alexandra Alper, instructs diplomatic posts worldwide to warn governments about the "risks of utilizing AI models distilled from U.S. proprietary AI models" and to "lay the groundwork for potential follow-up and outreach by the U.S. government." It states that campaigns "deliberately strip security protocols from the resulting models and undo mechanisms that ensure those AI models are ideologically neutral and truth-seeking." A separate demarche was sent to Beijing. The cable names DeepSeek, Moonshot AI, and MiniMax. The Chinese Embassy called the accusations groundless allegations and deliberate attacks on China's development in a statement to Reuters.

Anthropic published a formal threat report on February 23, 2026 stating that DeepSeek, Moonshot AI, and MiniMax collectively used approximately 24,000 fraudulent accounts to generate over 16 million exchanges with Claude, with DeepSeek alone responsible for over 150,000 exchanges targeting reasoning tasks and chain-of-thought data. Anthropic's detection methodology included IP address correlation, request metadata, infrastructure indicators, and unnamed industry partner corroboration. The report describes "hydra cluster" architectures distributing traffic across API endpoints and cloud platforms to evade rate limits. One proxy network managed over 20,000 simultaneous fraudulent accounts, mixing distillation requests with legitimate customer traffic.

The open question is whether any of the ten teacher models used in V4's On-Policy Distillation phase include outputs from U.S. frontier systems obtained through the alleged fraudulent-account campaigns. If the teachers are all domestic Chinese models trained independently, the cable's framing against V4 specifically loses force. If the teachers are unnamed, or if any include U.S. systems, the cable's allegations gain empirical weight. DeepSeek uses distillation openly; the State Department alleges distillation was unauthorized; the teacher model identities are the variable that connects those two claims.

The V3 paper set the disclosure standard. V4 omits the equivalent disclosure for its post-training teacher set. That absence should be read as a finding, not a gap.

What the State Department Cable Signals About the Export-Control Regime

The cable is not an enforcement instrument. It is a signaling instrument. It instructs posts to "warn of the risks" and to "lay the groundwork for potential follow-up." The cable explicitly positions itself as preparatory to future action, not as policy in force.

The substance of the signal is a pivot. For three years, the U.S. export-control regime tightened hardware access: ECCN 3A090.c capped HBM exports to China above 2 GB/s/mm² bandwidth density; BIS (Bureau of Industry and Security) moved H200 exports from presumption-of-denial to case-by-case review in January 2026 with a 25% Section 232 tariff; A100 and H100 chips were restricted or modified for Chinese buyers under earlier iterations of the rule. The regime's theory was that frontier-model development requires frontier hardware, and controlling hardware access constrains model development timelines.

The cable acknowledges through its own reframing that the hardware-access theory no longer holds. Chinese open-weight models including DeepSeek V4-Pro, Moonshot AI's Kimi K2.6, and Alibaba's Qwen3.6-35B-A3B are within benchmark striking distance of U.S. frontier systems on several standard evaluations (per their published model cards, accessed April 2026), using architectures optimized for Ascend 950PR-class hardware. NVIDIA plus HBM3e remains the dominant training platform for 1.6-trillion-parameter models; Ascend 950PR is absorbing inference now and Ascend 950DT is scheduled to absorb training at scale in Q4 2026. The two stacks are operating in parallel.

The cable's framing repositions the contest from chip-level constraints to model-output constraints. It argues that distillation campaigns "enable foreign actors to release products that appear to perform comparably on select benchmarks at a fraction of the cost but do not replicate the full performance of the original system." This is a claim about model IP as the defensible boundary, not hardware. If the claim is accurate, downstream customers evaluating V4-Pro or Kimi K2.6 or Qwen3.6-35B-A3B against U.S. frontier systems need to understand that benchmark parity does not guarantee capability parity at the tails of the performance distribution or under adversarial robustness testing.

The enforcement question is whether BIS, OFAC, or Commerce convert the cable's framing into rulemaking. A diplomatic cable dated the same day as a major model release and timed three weeks before a Trump-Xi summit scheduled for May 2026 is a posture move. If no Federal Register notice, Entity List update, or new ECCN proposal citing model distillation or model extraction appears within 90 to 180 days, the cable was signaling only. If rulemaking follows, U.S. AI labs and their downstream customers face new compliance obligations on model-output sharing, API access policies, and distillation-detection infrastructure.

What This Means for Model Developers and Infrastructure Planners

For teams managing AI infrastructure budgets or evaluating which frontier models to license, deploy, or integrate, the V4 release and the cable together describe a supply-chain environment where two hardware stacks now support frontier-tier inference, where pricing asymmetries of ten-to-one or more will drive commercial adoption regardless of benchmark gaps, and where the policy regime governing both stacks is in transition.

The technical evaluation question is whether V4-Pro's reported 73% FLOP reduction and 90% KV-cache reduction hold under production workloads on non-DeepSeek serving stacks. NVIDIA's GB200 benchmarks confirm performance on NVIDIA hardware. Independent reproduction on vLLM, SGLang, or MindSpore serving stacks within the next 30 to 90 days will indicate whether the efficiency claims are hardware-portable or tuned specifically to DeepSeek's own infrastructure. If the reductions hold across serving platforms, the architecture is genuinely efficient. If they do not, the efficiency is an artifact of co-optimization with a specific stack.

The procurement question is whether Ascend 950DT ships on schedule in Q4 2026 and at what volume. The 950PR is available now; the 950DT is the training-class part. For national labs and AI factory operators evaluating whether next-generation training procurements can diversify away from NVIDIA silicon, the 950DT timing and volume targets are the operative variables. Huawei's Atlas 950 SuperPoD and SuperCluster announcements describe capability, not availability. First independent benchmarks from a non-Huawei lab on 950DT and first customer announcements for Atlas 950 deployments will determine whether "DeepSeek on Ascend" describes a present capability or a 2027 story.

The policy question is what follow-up the State Department cable triggers. If the cable leads to model-export controls, Entity List additions, or new distillation-detection compliance requirements, U.S. labs will need to treat API access policies and output-sharing agreements as export-control surfaces. If the cable does not lead to rulemaking, the shift from hardware controls to model-IP framing remains rhetorical.

The bottom line is that two frontier-AI compute stacks are now technically and economically viable in parallel: NVIDIA plus HBM3e on one side, Ascend 950PR for inference and Ascend 950DT (pending Q4 2026) for training on the other. The policy regime that built the first stack's export walls is now visibly catching up to the second stack's existence. What remains unresolved is whether the U.S. government will regulate model outputs with the same force it regulated chip exports, and whether the hardware economics that made the second stack necessary will prove durable once the 950DT ships at volume.

What to Watch

The V4 technical report exists and was published April 24, 2026. The V3 paper disclosed training hardware in detail; the V4 paper does not. That absence is reportable. If DeepSeek clarifies V4-Pro's pretraining hardware in response to press inquiries or if independent researchers fingerprint the training stack through weight analysis or infrastructure leaks, the training-hardware question resolves. If the absence persists, it should be read as deliberate.

The ten teacher models used in On-Policy Distillation remain unidentified. Community analysis of the V4 paper, independent attempts to fingerprint teacher model outputs through prompt-response correlation, or any DeepSeek clarification in response to reporting will determine whether the teacher set includes U.S. frontier systems. If all teachers are domestic Chinese models trained independently, the State Department cable's distillation allegations against V4 specifically lose empirical support. If any are U.S. systems or if the identities remain undisclosed, the cable's framing gains force.

Ascend 950DT shipment timing and volume will separate present capability from projected capability. Huawei's Q4 2026 target for the 950DT, the Atlas 950 SuperPoD, and the Atlas 950 SuperCluster describes a roadmap. First independent benchmarks on 950DT at production scale, first non-Huawei customer announcements for SuperPoD deployments, and first large-batch training runs logged on Ascend hardware at the 1.6-trillion-parameter tier will test whether the roadmap converts to operational infrastructure.

Federal Register notices, Entity List updates, or new ECCN proposals citing model distillation or model extraction within 90 to 180 days of the April 24 cable will indicate whether the State Department's pivot from chip controls to model-IP controls is backed by enforcement authority. A cable is a signal. Rulemaking is policy. The gap between the two determines whether U.S. AI labs face new compliance obligations or whether the cable was timed to the Trump-Xi summit and intended as posture.

Independent reproduction of V4-Pro's reported FLOP and KV-cache reductions on vLLM, SGLang, or MindSpore production stacks within 30 to 90 days will indicate whether the efficiency claims are hardware-portable or stack-specific. NVIDIA's day-zero benchmarks confirm the model runs on Blackwell. Community reproductions on non-NVIDIA and non-DeepSeek infrastructure will test whether the 73% FLOP reduction and 90% KV-cache reduction hold under third-party serving workloads at one-million-token context.

Bottom Line

If V4-Pro's efficiency claims hold under independent validation, and if Ascend 950DT ships on schedule in Q4 2026 at the volumes Huawei projects, the export-control regime that spent three years tightening chip access will face a frontier-AI supply chain where two hardware stacks support model development and deployment at comparable capability tiers with order-of-magnitude pricing asymmetries. The State Department cable dated the same day as V4-Pro's release names the policy response under construction: treat model outputs, not chip access, as the defensible boundary. Whether that response converts to enforceable rules, and whether the second compute stack proves durable once the training-class Ascend hardware ships at volume, are the two variables that will determine what the Sovereignty Race looks like in 2027.

Export Controls & Trade Policy AI Infrastructure Inference Economics Supply Chain & Critical Materials NVIDIA Model Distillation Chinese Frontier Models