UCCL-EP vs. NCCL EP: Portability or Consolidation for MoE Communication?

Two new expert-parallel efforts point to different futures for MoE systems: one built for heterogeneous fleets, the other folded into NVIDIA’s stack.

Illustration of token streams routing through a central AI communication layer to a small set of active compute nodes inside a larger data center, representing sparse activation and expert-parallel communication.
Concept image depicting sparse activation in a mixture-of-experts system, where token traffic is routed to a small subset of active compute nodes inside a larger AI infrastructure environment.AI-generated

For the past year, expert-parallel communication has often been treated as a niche concern inside the broader Mixture-of-Experts boom, important mostly to teams trying to serve DeepSeek-class sparse models efficiently. That view is already getting stale.

The issue now is not just whether one dispatch kernel beats another. It is whether MoE dispatch and combine are becoming supported infrastructure instead of custom glue code. Two recent efforts make that shift visible: UCCL-EP: Portable Expert-Parallel Communication and NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL.

The split is straightforward. UCCL-EP is arguing for portability. NCCL EP is arguing for consolidation.

That matters because the current baseline, DeepEP, is fast in part because it is tightly coupled to NVIDIA’s stack. DeepEP describes itself as an expert-parallel communication library with separate high-throughput and low-latency all-to-all GPU kernels for MoE dispatch and combine, and it depends on NVSHMEMGPUDirect RDMA, and IBGDA. Its own installation notes call for NVLink-connected GPUs, RDMA across nodes, and NVIDIA’s networking path, even if CPU-assisted IBGDA is available as a fallback with some performance penalty.

That is a strong answer if your world is H800s, H100s, NVLink, and NVIDIA-centric networking. It is a weaker answer for public cloud fleets, mixed-vendor environments, or operators who do not want MoE performance tied to a single vertically integrated hardware story.

UCCL-EP starts from exactly that gap. The paper, project post, and repo all make the same basic point: expert-parallel communication should not require GPU-initiated RDMA tightly bound to NVIDIA-controlled hardware paths. Instead, UCCL-EP uses a GPU-to-CPU control channel, with multithreaded CPU proxies issuing GPUDirect RDMA operations on the GPU’s behalf. That may sound less elegant than device-initiated networking, but it is also what gives the project a credible path to AWS EFA, Broadcom NICs, and AMD GPUs, rather than only the standard NVIDIA recipe.

Diagram showing UCCL-EP expert-parallel communication connecting AMD and NVIDIA GPUs with multiple networking options including AMD, Broadcom, and AWS EFA, alongside servers, GPUs, and NICs.
UCCL-EP architecture graphic showing the project’s cross-vendor expert-parallel communication model across AMD and NVIDIA GPUs, multiple NIC options, and AWS EFAUCCL project blog.

The authors are also clear that this is not just an API exercise. They report up to 2.1x higher dispatch and combine throughput on EFA versus the best existing EP solution, comparable performance to DeepEP on NVIDIA-only systems, up to 40 percent higher token throughput in SGLang on NVIDIA plus EFA, and up to 45 percent higher DeepSeek-V3 training throughput on a 16-node AMD plus Broadcom setup, according to the abstract. Those are substantial claims, and they are backed by public materials.

But benchmark portability is not the same as production portability. UCCL-EP looks like the clearest public push to make expert parallelism cloud-portable, but the public record still says more about benchmarks and framework experiments than about named production deployments.

NCCL EP tackles the same problem from the other direction. Instead of trying to escape the dominant stack, it tries to absorb expert parallelism into it. NVIDIA’s NCCL Device API documentation shows why that is now plausible. Since NCCL 2.28, the library has exposed a device-side communication API, including GIN, or GPU-Initiated Networking, for network communication. The catch is that GIN comes with strict requirements, including NVIDIA GPUs, NVIDIA NIC support, GPUDirect RDMA prerequisites, and specific topology assumptions.

Diagram of the NCCL EP software stack showing user applications calling the NCCL EP API, which routes expert-parallel communication through low-latency and high-throughput kernels, NCCL GPU-initiated networking for inter-node RDMA, and native NVLink code for intra-node communication.
NCCL EP stack diagram showing how expert-parallel dispatch and combine connect user frameworks to low-latency and high-throughput kernels, with inter-node communication handled through NCCL GIN and intra-node paths through NVLink.NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

That is the point. NCCL EP is not trying to be universal. It is trying to make MoE communication feel native inside the NVIDIA software substrate that already dominates large-scale inference clusters.

The NCCL EP paper describes two user-facing primitives, ncclEpDispatch and ncclEpCombine, with both a low-latency mode for decode and a high-throughput mode for training and prefill. That split matters. Public framework documentation increasingly reflects the fact that decode and prefill need different communication behavior. DeepEP ships separate low-latency and high-throughput kernels. SGLang’s expert-parallelism docs note that DeepEP and Mooncake expose normal mode for high-throughput prefill and low-latency mode for decode, and recommend auto-switching between them at runtime. That is one reason generic collectives have never been a clean fit for expert parallelism.

NCCL EP’s pitch, then, is not just speed. It is that MoE communication is mature enough to warrant a standardized API inside NCCL instead of yet another standalone side library.

The open question is rollout. The public evidence says NCCL EP is real, but not yet obviously mainstream. The paper presents end-to-end vLLM integration results. A public NCCL GitHub issue references contrib/nccl_ep and explicitly asks whether NVIDIA plans to upstream the demonstrated vLLM integration. That is useful evidence that code exists and that users want it. It is not the same as broad product support.

Framework visibility makes that gap hard to miss. vLLM’s public docs already treat expert parallelism as a real deployment mode, and its public CLI documentation lists backends including allgather_reducescatterdeepep_high_throughputdeepep_low_latencyflashinfer_all2allvmorinixl_ep, and pplx in user-facing choices surfaced in the docs search results. SGLang likewise exposes multiple EP backends, including DeepEP, Mooncake, NIXL-EP, MORI, and FlashInfer. What is still missing from the visible public control plane is NCCL EP as a standard, documented backend operators can simply enable.

NCCL EP looks like a serious step toward standardization inside NVIDIA’s stack, but the public record does not yet clearly show it as a shipped, default, broadly documented production backend in mainstream frameworks.

That leaves the market in a familiar place. Expert-parallel communication is becoming framework-visible. It is no longer confined to research-paper microbenchmarks. LMSYS and SGLang have already shown that large-scale EP serving is possible in the open, including a 96-H100 DeepSeek-style deployment path with prefill-decode disaggregation and DeepEP support. But the field is still fragmented.

So the real takeaway is not that expert-parallel communication has already gone mainstream. It is that the mainstreaming process has started, and the industry now has two distinct paths in front of it.

If portability wins, UCCL-EP matters because it treats expert parallelism as a cross-vendor systems layer that should survive heterogeneous clouds and non-NVIDIA networking. If consolidation wins, NCCL EP matters because it turns MoE communication into another supported primitive inside the dominant GPU stack.

Either way, this is no longer just a kernel story. It is a fight over who gets to define the infrastructure layer for sparse AI.

🤖 AI Disclosure

AI-assisted research and first draft. This article has been verified by a human editor.