MRC Gives Open Ethernet Its First 75,000-GPU Production Proof Point

The 50-author MRC paper gives Ethernet its first multi-vendor, open-spec, production-trace answer to the one argument InfiniBand had left at frontier-training scale.

Abstract three-quarter rendering of a two-tier network fabric - a row of leaf switches beneath a shorter row of spine switches, connected by hundreds of fine indigo parallel lines. — Per-packet spraying is what makes the two-tier topology viable at 100,000-GPU scale. The protocol assumes the fabric. The fabric only works because the protocol sprays.AI-generated / SCN

On May 5, OpenAI, Microsoft, AMD, Broadcom, and NVIDIA published a 50-author paper describing Multipath Reliable Connection, a new RDMA transport built for lossy, multipath Ethernet AI fabrics. The reframe at the heart of MRC is simple and counterintuitive: stop chasing ever-faster ports, and start using the same aggregate switch bandwidth to build dramatically higher-radix, flatter topologies. A decade of Ethernet roadmap planning has been organized around port speed (100G → 200G → 400G → 800G → 1.6T). MRC inverts that priority. The valuable thing isn't the fattest pipe into a GPU; it's the largest number of independent paths between any two GPUs. Everything else in the protocol follows from that single move. MRC extends RoCE/RC semantics, borrows ideas from Ultra Ethernet Transport, and has already run in production on current-generation 400G and 800G NICs.

The interesting evidence in the paper isn't a lab benchmark. It's a set of production traces from frontier-model training: continuous T0–T1 link flaps, four T1 switch reboot cases during a 75,000-GPU pretraining job, and a separate 50,000-GPU optical-transceiver glitch that caused a one-minute throughput dip without crashing the job. The paper is also clear that NIC-transceiver failures remain a single point where queue pairs can fail. So this is not "every failure has been engineered away." It's that Ethernet now has a multi-vendor, open-spec counterexample to the argument that only InfiniBand can hold together at frontier-training scale.

MRC is not Ultra Ethernet 1.0. The Ultra Ethernet Consortium published its 1.0 specification in June 2025 and has been working through hardware compliance since. MRC is a smaller, more targeted thing: a minimal extension to standard RoCEv2 that picks up the most useful pieces of Ultra Ethernet Transport (per-packet spraying, selective retransmission with out-of-order memory placement, ECN-driven adaptive load balancing, packet trimming, lossy-fabric operation) and packages them so they ship on existing NIC and switch silicon today. Microsoft and partners have contributed the spec to the Open Compute Project, and the OCP Multipath Reliable Connection Specification, Revision 1.0, dated March 21, 2026 is now live. Supporting software has shipped in parallel: libMRC APIs, an NCCL plugin, an ibverbs shim intended to let supported libibverbs / NCCL / RCCL workloads run over MRC without source changes, MSCCL++ support, and SRv6 extensions for SONiC. Microsoft's tech community post and OpenAI's companion post on the protocol lay out the framing.

The open-Ethernet trajectory just stopped being a roadmap and started being a deployable artifact.

What MRC Does at the Transport Layer

Classic RoCE deployments typically hash a QP/flow onto a single ECMP path, which creates flow-collision problems at scale. When that path congests or drops a packet, the protocol falls back on go-back-N retransmission and depends on the network being effectively lossless, which means Priority Flow Control (PFC) keeps switches from dropping anything. PFC can work in constrained environments, but at large synchronized-training scale it becomes operationally unattractive: pause propagation and head-of-line blocking can create tail-latency outliers that blow a single AllReduce step past its budget.

MRC starts from the opposite assumption: the network will lose packets, the transport handles loss, and the fabric runs PFC-off. Five mechanisms documented in Sections 1 and 2 of the arXiv paper do the work.

Per-packet spraying. The sender holds an Entropy Value (EV) set per queue pair, typically 128 to 256 EVs, and rotates through them packet by packet. Each EV maps to a different physical path through the fabric. A single QP can saturate hundreds of paths simultaneously instead of pinning itself to one.

SACK with out-of-order memory placement. Each packet carries an RDMA virtual address and remote key, so the receiver lands a payload directly in memory the moment it arrives, regardless of order. Selective ACKs tell the sender exactly which packets to resend.

ECN-driven adaptive load balancing. Switches mark ECN on congestion. The receiver echoes it. The sender pulls the offending EV out of its active set. That replaces DCQCN-plus-PFC's rate-based response with something faster and more surgical.

Packet trimming. On congestion, switches forward the header and drop the payload, generating a fast NACK back to the sender. Useful for incast.

PFC-off. Pause storms cannot happen because pause is disabled. Loss is treated as a normal signal.

The protocol intentionally implements only a subset of RDMA Verbs (`RDMA WRITE` and `WRITE\_WITH\_IMMEDIATE`) because those are the operations AllReduce and AllGather actually need. That choice keeps MRC small enough to ship as an ibverbs shim library, so NCCL and MSCCL++ workloads run over it without source changes.

SRv6 Source Routing in the IPv6 Address

MRC pairs the transport with SRv6 micro-segment ID (uSID) source routing. The sender's NIC writes the full per-hop path into the IPv6 destination address, and each 16-bit uSID names a specific switch in the route. Each hop left-shifts the address to expose the next segment. The paper justifies the choice over MPLS or BGP-based routing on two grounds.

Conventional routing reacts on control-plane timescales: many RTTs and, in OpenAI's framing, potentially seconds or tens of seconds. MRC claims path bypass on a microsecond timescale. Two-tier high-radix Clos with multi-plane fanout also creates ECMP sets too large to enumerate, manage, or diagnose. Pushing path selection out to the endpoint makes that tractable; the NIC observes the result and adapts.

When a path degrades, the NIC removes the affected EV from the active set and swaps in a backup. Switches do not recompute routes; endpoints remove bad EVs and steer around the path. That is how MRC gets to the few-tens-of-microseconds reaction time the paper claims.

Lane Splitting and the Multi-Plane Clos

The transport assumes a topology. The topology, described in Section 2 and Figure 1b of the paper, is the second half of the design.

An 800 Gb/s NIC is split into multiple lower-speed lanes, either 4 × 200 Gb/s or 8 × 100 Gb/s, and each lane is connected to a different top-of-rack (T0) switch in a different plane of a multi-plane Clos. Instead of one fat 800 Gb/s link into one T0, the GPU sees four or eight independent paths into four or eight parallel fabrics.

The economic case lands hardest when you put the two designs side by side at the same ASIC budget. Using 51.2 Tb/s switch silicon, a conventional three-tier RoCE Clos with 800 Gb/s ports versus a two-tier MRC design with the same ASICs split into 100 Gb/s lanes across eight planes:

Metric	Three-tier RoCE (800G ports)	Two-tier MRC (8 × 100G planes)
GPU/XPU endpoints	65,536	131,072
Switch count	5,120	6,144
Max hops between endpoints	5–7	3
Total links	196,608	1,179,648

The headline trade is roughly 20% more switches for 2× the compute capacity, with a flatter topology and shorter worst-case path. The catch is in the bottom row: the link count goes up by roughly 6×. That is the cost the design is actually paying, and it shows up in optics, not in switch ASICs.

The consequences show up in the failure modes. A single transceiver failure costs 12.5% of NIC bandwidth in the 8-plane case, or 25% in the 4-plane case, not 100%. A T0 switch reboot affects one plane out of four or eight; MRC's spraying keeps the QP alive across the remaining planes. Two-tier topologies become viable for 100,000-GPU clusters because each plane only fans out to a fraction of the GPUs. The paper gives an 8-plane example where each T0 connects to 256 NIC ports and 256 T1 switches.

The protocol and the topology are inseparable. The protocol assumes the topology, and the topology only works because the protocol sprays.

The Optics Cost Nobody Has Shown Publicly

The link-count math is the place to be honest about what MRC actually costs. Going from ~200,000 links to ~1.2 million links at the same endpoint count is a roughly 6× increase in optical transceivers, plus the cabling and the patch infrastructure to land them. At rough public pricing of $500–$1,500 per 100G optic depending on reach and volume — and these are deployment-grade numbers, not list — the optics bill on a 130,000-GPU MRC fabric runs into hundreds of millions of dollars, sitting on top of a switch budget that itself is only 20% higher than the three-tier alternative. The trade isn't free, and the paper doesn't show the bill of materials.

Speculation flagged. The all-in TCO math probably still favors MRC, but for a reason the optics line item obscures: GPU-stall cost asymmetry. One hour of stall on a 100,000-GPU cluster, at a conservative $2/GPU-hour blended rate, is $200,000 of pure burn. A frontier pretraining run that has to restart from a checkpoint because PFC pause storms collapsed the fabric can lose six to twelve hours in a single event. Two avoided events pay for a meaningful chunk of incremental optics across the life of the cluster. Nobody has published this math publicly, and the absence is conspicuous. It is the next paper somebody should write, and the answer determines whether MRC's economics work at neocloud scale or only at hyperscaler scale where the GPU-hour asymmetry is largest.

A related operational cost MRC eliminates is the PFC tax. Production RoCE deployments at hyperscale have spent substantial engineering on PFC tuning, deadlock prevention, buffer sizing, and pause-storm mitigation — anecdotally, this has been the dominant operational pain point in large RoCE fabrics for years. PFC-off operation makes that engineering category go away. The cost savings there are real but hard to quantify from the outside, and the paper does not try.

What the Production Traces Actually Show

OpenAI says MRC is deployed across its largest NVIDIA GB200 supercomputers, including Microsoft's Fairwater AI superfactory architecture and the Stargate Phase 1 site at Abilene operated by OCI, the latter also covered in The Deep View's interview with OpenAI networking lead Mark Handley. Section 4 of the paper documents the production traces by Cluster A through D rather than by site name. The 75,000-GPU pretraining job is reported as a Cluster A trace.

The events the paper actually documents are distinct, not a single mega-event.

The 75,000-GPU pretraining job on Cluster A had four T1 switch reboot cases over the trace. One documented reboot affected roughly a quarter of the QPs and dropped about 580,000 packets, but MRC mapped out the bad path and aggregate throughput largely recovered after the initial dip.

Continuous T0–T1 link flaps ran in the background of Cluster A for the duration of the run. Almost no impact on the job, because MRC swapped EVs onto unaffected paths.

A separate 50,000-GPU job hit a T0 switch optical transceiver that glitched and flapped four NIC–T0 links. Throughput fell roughly 25% for about a minute, then recovered. The job did not crash, no QP failed, and affected nodes were not removed.

At job startup, packet loss fell below one packet per second per NIC (about one in 25 million at 800G) within a couple of minutes. Fewer than five packets per QP were lost in the first minute.

One caveat the paper is explicit about: NIC-transceiver flaps remain a single point of failure. If the transceiver on the NIC itself flaps, every port on that NIC is lost and QPs can fail. MRC bounds a lot of failure modes, and the paper is honest that it doesn't claim to bound all of them.

Path-failure detection and recovery happens "in a few tens of microseconds," per Section 2.1. Short enough to bound many network faults before they become job-ending events, though the paper still shows temporary throughput dips for some failures.

Throughput numbers match the survival story at the high end. MRC hits ~770 Gb/s goodput on an 800 Gb/s link, about 96% of theoretical peak, for 32 KB point-to-point messages on Cluster B with NVIDIA ConnectX-8, whether traffic stays local to a T0 or crosses T1.

The MRC-vs-RoCE comparison is a separate, controlled experiment. The paper itself says it does not have a large deployment that can directly compare MRC and RoCE, and runs the comparison on small testbeds, including a 64-GPU AMD Pollara 400G all-reduce setup. In that testbed a single RoCE QP suffers ECMP hash collisions and reaches roughly half of possible throughput, while one MRC QP spraying across 256 paths beats 16 RoCE QPs (Figures 16 and 17). At scale, the paper reports NCCL sendrecv at 42,000 GPUs reaching up to 92 GB/s for large messages.

The Silicon Coalition

This is multi-vendor in a substantive way, not just a coalition press release. The paper names three NIC silicon vendors implementing MRC at 400 or 800 Gb/s and two switch silicon vendors with SRv6 support. Intel is named in OpenAI, Microsoft, NVIDIA, and OCP materials as a credited collaborator on MRC, but the arXiv author list and the evaluated implementation table do not identify Intel authors or Intel NIC/switch silicon in the tested systems. Read Intel as a credited collaborator here, rather than as a featured silicon implementer in the production results.

Role	Publicly identified implementation
NICs	NVIDIA ConnectX-8; AMD Pollara / Vulcano; Broadcom Thor Ultra
Switch silicon	NVIDIA Spectrum-4, Spectrum-5; Broadcom Tomahawk 5
Switch OS / NOS	Cumulus, SONiC, Arista EOS
Credited MRC collaborators / contributors	AMD, Broadcom, Intel, Microsoft, NVIDIA, OpenAI

The author list is not just hyperscaler engineering. It includes deep transport and HPC lineage, including Costin Raiciu's multipath-transport background and Torsten Hoefler's large-scale HPC/AI systems work at ETH/CSCS, where he holds the 2024 ACM Prize in Computing alongside his consulting role on the Microsoft side of the paper.

Torsten Hoefler shares his team's work.LinkedIn

MRC Is Not Ultra Ethernet 1.0

This is the part of the story most likely to get confused, because three things are happening in parallel.

The Ultra Ethernet 1.0 specification: 562 pages, released June 2025, currently at v1.0.2 (January 2026). UEC 1.0 is broader than transport alone. It defines a full Ethernet-based communication stack across NICs, switches, optics and cables, software/API, link layer, transport, and congestion control, with three profiles (HPC, AI Full, AI Base). The consortium has more than 100 member companies; founding members include AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta, and Microsoft.

MRC itself: a minimal extension to RoCEv2 that implements UET-style behavior inside the existing Verbs API. It is a production-deployed precursor that Microsoft and partners shipped while UEC hardware compliance programs were still maturing. The paper is explicit about the relationship. MRC borrows from UET; it does not replace it.

The OCP release. Microsoft is donating MRC to the Open Compute Project, not to UEC. That is a tactical choice with consequences. OCP is the open-hardware deployment path; UEC is the network-standard path. Releasing through OCP positions MRC as a deployable artifact (code, plugins, switch OS extensions, a shim library) rather than as a draft specification.

The cleanest frame is that MRC is the bridge. It is what UEC-style transport looks like when you have to ship on the silicon that exists, train ChatGPT and Codex on it, and survive real production failures before the UEC 1.0 hardware ecosystem fully reaches compliance. The two efforts are feedback flowing in the other direction from each other, not in tension.

What It Means for the Ethernet-vs-InfiniBand Trajectory

The market has been moving for two years.

Dell'Oro Group's July 2025 release put InfiniBand at over 80% share when Dell'Oro initiated AI back-end network coverage in late 2023. Dell'Oro's March 2026 release reports that Ethernet switch sales in AI back-end networks more than tripled and accounted for more than two-thirds of data center switch sales in AI clusters in Q4 2025 and for the full year, with Amazon, Microsoft, Meta, Oracle, and xAI named as Ethernet adopters. Dell'Oro also notes that InfiniBand revenue continues to grow.

Meta's 24,576-GPU RoCE cluster was the prior large public Ethernet proof point, with Meta saying the broader 24K-cluster design supported Llama 3 training. MRC at 75,000 GPUs is the new public ceiling: roughly three times that scale, on a production frontier-model job, with documented fault tolerance. SCN previously reported on the Slingshot-vs-InfiniBand crossover at the Aurora exascale system, where an Ethernet-derived fabric held performance under AI traffic patterns that collapsed InfiniBand by 5x. That was the HPC-discipline proof point. MRC is the AI-at-scale companion.

Two constraints are now binding at the frontier of model training: networking and memory. The latter, as SCN has reported, is an HBM allocation problem rather than a supply problem. The former, after this paper, has a credible answer.

The AWS Absence

The coalition list is conspicuous for who isn't on it. AWS doesn't use RoCE in EFA, isn't on the MRC paper, and has its own Scalable Reliable Datagram (SRD) transport that solves a closely overlapping set of problems - per-packet spraying, out-of-order delivery, fast retransmit on loss, multipath fault tolerance - inside the Nitro cards. SRD has been in production at AWS scale for years. The technical precedent for "RDMA-class transport on lossy Ethernet with multipath" is genuinely Amazon's, even if the open-Ethernet community has not been in a hurry to credit it.

What changes with MRC is the openness layer. SRD is Amazon-only, runs on Amazon silicon, and is accessible only by buying EC2. MRC effectively standardizes the same architectural pattern across NVIDIA, AMD, and Broadcom NICs, two major switch silicon families, three NOSes, and an OCP spec anyone can implement. Opinionated take. MRC commoditizes what was AWS's quietly-held networking moat. AWS spent years amortizing SRD development on the basis that nobody else had it; the rest of the hyperscaler ecosystem now does, on open silicon, with a publishable spec. Whether AWS responds by opening SRD, by extending Nitro further into the stack to maintain differentiation, or by accepting that the moat has eroded is a 2026 story worth watching.

One scope note that matters for both pieces and that neither has flagged clearly enough: MRC is a scale-out story. The fabric between NVL72 / NVL144 / equivalent racks. It does not change the scale-up story inside the rack, where NVLink Switch, the emerging UALink standard, and AMD's Infinity Fabric variants continue to dominate. "Ethernet wins" is the wrong takeaway. "Ethernet wins the scale-out tier while proprietary scale-up fabrics hold inside the rack" is the right one, and that boundary is where the next architectural fight will happen.

NVIDIA's Position

The easy framing is that this paper routes around NVIDIA. The easy framing is wrong. NVIDIA is on the paper. ConnectX-8 implements MRC. Spectrum-4 and Spectrum-5 ship SRv6 support. NVIDIA is preserving both exits: InfiniBand stays the premium integrated option, while ConnectX / Spectrum Ethernet keeps NVIDIA in every multi-vendor Ethernet RFP.

The company has been signaling that diversified-stack posture publicly, including at GTC 2026, where Jensen Huang's keynote framed NVIDIA's reach across infrastructure layers the company does not own the silicon for. MRC participation is that posture made concrete at the network layer.

The Open Compute Bet

The piece of the story most likely to be underweighted is the open-source release. Microsoft has contributed the MRC specification to OCP, with Revision 1.0 live as of March 21, 2026, alongside libMRC APIs, an NCCL plugin, an ibverbs shim, MSCCL++ support, and SRv6 extensions for SONiC.

The shim library is a friction-reducer, not a universal compatibility layer. Microsoft's repository is explicit that it supports RDMA WRITE and WRITE_WITH_IMM and does not support Read, Send, Atomic, or some extended APIs. Within those bounds, migration cost for an existing RoCE deployment is genuinely low, and supported NCCL / RCCL workloads can move without source changes. Neoclouds and second-tier hyperscalers can start deploying MRC-style fabric without waiting for full UEC 1.0 hardware compliance from their NIC vendor. That shifts differentiation back toward silicon execution (radix, port speed, power) and topology design, which is territory that favors Broadcom, Arista, and the wider Ethernet ecosystem over a single-source proprietary fabric.

That is the most concrete form of the software ceiling lifting that this story shows. To be clear, MRC isn't pure software. It's a full co-design: transport plus NIC implementation plus switch support plus topology plus operations. The OCP release is what makes that co-design legible and reusable outside the partner set.

What's Left to Watch

A few questions worth tracking from here.

Compliance versus compatibility. MRC is UET-style but not UEC 1.0. Whether it interoperates cleanly with UEC 1.0-compliant NICs from non-founding-member vendors will determine whether OCP becomes the de facto center of gravity or the ecosystem bifurcates.

National lab adoption. Hoefler's endorsement carries the most weight in the procurement gravity field around CSCS, DOE, Argonne, and equivalents internationally. Whether MRC plus OCP becomes a next-cycle reference design or stays a hyperscaler artifact will tell.

InfiniBand's defensive move. NVIDIA still ships InfiniBand. Its roadmap response, whether higher port speeds, lower price, or quiet repositioning, should surface within the year.

Independent verification. A few-tens-of-microseconds recovery time and 96% line-rate goodput at 75,000-GPU scale is a strong claim. Reproduction outside the partner set is the test.

Power and optics. Lane splitting means more transceivers per NIC. Whether multi-plane Clos materially changes the power-per-GPU envelope at Fairwater scale is a question for the next paper.

Inside the partner-set evidence, the protocol works. The OCP release is live. The 75,000-GPU run is a frontier-model pretraining trace that survived its own production faults, not a benchmark. None of that kills InfiniBand, but it does take away one of InfiniBand's cleanest strategic arguments: the one that says Ethernet can scale, or Ethernet can be open, but Ethernet can't yet provide production-grade resilience at frontier-training scale. That argument now has a multi-vendor, open-spec counter, and the counter already trained a frontier model.

AI Infrastructure NVIDIA AI-HPC Convergence Hyperscaler Strategy

Metric

Three-tier RoCE (800G ports)

Two-tier MRC (8 × 100G planes)

GPU/XPU endpoints

65,536

131,072

Switch count

5,120

6,144

Max hops between endpoints

5–7

Total links

196,608

1,179,648

Role

Publicly identified implementation

NICs

NVIDIA ConnectX-8; AMD Pollara / Vulcano; Broadcom Thor Ultra

Switch silicon

NVIDIA Spectrum-4, Spectrum-5; Broadcom Tomahawk 5

Switch OS / NOS

Cumulus, SONiC, Arista EOS

Credited MRC collaborators / contributors

AMD, Broadcom, Intel, Microsoft, NVIDIA, OpenAI