OpenAI Open-Sources MRC to Fix AI Supercomputer Jams

Six companies - OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA - dropped what may be the most practically significant infrastructure specification of the year on May 6. MRC, which stands for Multipath Reliable Connection, is now a public open standard published through the Open Compute Project. It attacks a problem that gets worse the bigger you build: at 100,000 GPU scale, your network will kill your training run before your hardware does.

TL;DR

MRC (Multipath Reliable Connection) is an open networking protocol that spreads GPU traffic across hundreds of simultaneous paths instead of one
Failure recovery drops from seconds to microseconds, removing the straggler effect that causes expensive GPUs to idle
Connects 100,000+ GPUs using only two Ethernet switch tiers instead of the three or four required today
Already deployed on OpenAI's NVIDIA GB200 supercomputers at Oracle's Abilene, Texas facility and Microsoft's Fairwater data center
Spec and research paper are publicly available via the Open Compute Project

The Problem Nobody Discusses at the Start of a Training Run

Ask most people what limits AI training performance and they say compute - more GPUs, faster GPUs, better chips. The part that gets less attention is the fabric that connects those chips.

Congestion at the Core

Training a frontier model is essentially a distributed synchronization problem. Every few hundred milliseconds, gradients have to flow between all participating GPUs. In a cluster with 100,000 accelerators, that means an enormous, tightly coordinated burst of traffic crossing a shared switch fabric simultaneously. The switches central to conventional Ethernet networks weren't designed for this pattern - they congest, drop packets, and force retransmits that cascade into stalls.

Conventional Ethernet backbones also rely on routing schemes decided at the switch level. That means the network has no real-time awareness of which paths are hot and which are idle, and traffic piles onto whichever routes ECMP (Equal-Cost Multi-Path) happened to hash it onto.

The Straggler Tax

When one link in a large cluster slows down or a switch fails, the entire training run waits. This is called the straggler effect - your most expensive hardware sits idle while the collective waits for one laggard. At 100,000 GPUs, network failures aren't edge cases, they are expected. Hardware fails. Links degrade. Planned maintenance happens. With current fabrics, coordinating around any of that requires stopping or stalling training.

OpenAI put a concrete number on the recovery gap: conventional fabrics take seconds to tens of seconds to route around failures. MRC does it in microseconds.

Dense bundle of ethernet cables entering a network switch in a data center The switching fabric connecting large GPU clusters is a significant - and often underestimated - bottleneck in AI training at scale. Source: unsplash.com

How MRC Works

Spraying Packets Across Every Available Path

The core idea in MRC is multipath packet spraying. Instead of sending each transfer over a single selected path, MRC distributes individual packets simultaneously across hundreds of available paths through the network. This converts the network's congestion problem from a routing decision problem into a load-balancing problem - and load balancing is something fabric hardware handles well in hardware.

The effect is a dramatic flattening of traffic across the entire switch fabric. No single link becomes the bottleneck when traffic is spread this thin.

SRv6 Source Routing - Encoding Decisions at the NIC

The mechanism MRC uses to enable multipath spraying is SRv6, or Segment Routing over IPv6. In a conventional network, switches make routing decisions hop by hop based on destination address. SRv6 moves that intelligence to the edge: the network interface card on each server encodes the entire routing path directly into the IPv6 packet header.

This approach has two significant engineering benefits. First, it removes the need for stateful routing tables maintained throughout the switch fabric - the path is already in the packet. Second, it integrates naturally into existing IPv6 infrastructure, which is why OpenAI describes MRC as combining "the losslessness of InfiniBand with the flexibility of a stateless, global IPv6 standard." You get deterministic delivery without custom switch silicon.

Here is a simplified representation of the key fields an MRC-carrying SRv6 header encodes for a GPU-to-GPU transfer:

IPv6 + SRv6 Extension Header (simplified)
------------------------------------------
Next Header:    43 (Routing)
Routing Type:   4 (SRv6)
Segments Left:  3
Segment List:   [sw-tier1-plane2::slot12,
                 sw-tier2-pod4::port08,
                 gpu-rack-16::rdma-port]
Flow ID:        gradient-sync-step-47100
Entropy Label:  0x3A2F  (per-packet spray entropy)

The NIC selects entropy labels that spread packets across different segments. No switch needs to track state - it reads the header and forwards.

Microsecond Failure Recovery via Multi-Plane Architecture

MRC also introduces a multi-plane network topology for very large clusters. Instead of a single monolithic switch fabric, the network is divided into independent planes, each with its own set of tier-1 and tier-2 switches. Traffic can be steered between planes at the NIC level.

When a link or switch in one plane fails, MRC detects the degradation and shifts traffic to other planes in microseconds - without signaling a central controller and without requiring any coordination with the applications running on the GPUs.

OpenAI says that during a recent frontier model training run, engineers rebooted four tier-1 switches in the Abilene cluster without coordinating with the training team at all. The training continued through the maintenance window with no stall.

What MRC Replaces

MRC targets deployments running 800Gb/s network interfaces. Here is how its topology tradeoffs compare to the alternatives at the 100,000 GPU scale:

Dimension	InfiniBand	Conventional Ethernet (ECMP)	MRC (Ethernet + SRv6)
Switch tiers for 100K GPUs	2 (fat-tree)	3-4	2
Failure recovery	Seconds (fabric reconvergence)	Seconds to tens of seconds	Microseconds
Routing intelligence	In the fabric	In the fabric	At the NIC (SRv6)
Proprietary hardware required	Yes (Mellanox/ConnectX)	No	No
Open specification	Limited	No formal standard	Yes (OCP)
Packet ordering guarantee	Yes	No (ECMP reordering)	Yes (per-flow spraying)

The interesting technical bet embedded in MRC is that SRv6 at the NIC can deliver InfiniBand-grade losslessness over commodity Ethernet. That claim is now backed by production deployment at OpenAI's scale.

Rows of server racks inside a large-scale data center facility OpenAI's NVIDIA GB200 supercomputers at Stargate's Abilene, Texas site are already running MRC in production. Source: unsplash.com

Where It Is Already Deployed

MRC is not vaporware. OpenAI has been running it in production across its largest NVIDIA GB200 supercomputers, including:

Oracle Stargate Abilene, TX: The flagship Stargate cluster where OpenAI is launching tens of thousands of GB200 GPUs. Strikingly, Oracle and OpenAI scaled back expansion plans for this facility earlier this year, but the existing cluster is operational and running MRC.
Microsoft Fairwater: Microsoft's next-generation AI supercomputer site, where MRC is integrated into the network fabric from day one.

The five partner companies - AMD, Broadcom, Intel, Microsoft, and NVIDIA - have all integrated MRC into their upcoming 800Gb/s network interface hardware, meaning the protocol will be available across the major accelerator ecosystems rather than locked to any one vendor.

Where It Falls Short

A few honest caveats before infrastructure teams get excited:

Hardware bar is high. MRC's SRv6 source routing requires network interface cards that support encoding routing paths in packet headers. That feature is integrated into emerging 800Gb/s NICs but isn't available in most rolled out hardware today. Retrofitting existing clusters to MRC is not straightforward.

Two-tier topology has its own costs. The claim that MRC enables 100,000-GPU connectivity with only two switch tiers is technically true but depends on multi-plane architecture. Multi-plane topologies duplicate switch infrastructure across planes, so the cost reduction in per-plane switch count may be partially offset by the total number of physical switches needed.

It is an early specification. The OCP submission is dated May 6, 2026. Real-world interoperability testing across all five partner vendors in mixed-hardware environments is just beginning. The production deployments are homogeneous OpenAI environments; heterogeneous multi-vendor clusters at other hyperscalers are the real test.

Small clusters don't need it. The straggler effect MRC targets becomes acute at 50,000+ GPU scale. For teams running clusters of a few hundred to a few thousand GPUs - which covers most enterprise AI workloads and open-source training runs - conventional Ethernet fabric with ECMP is still fine. Infrastructure teams at companies running inference serving infrastructure like vLLM are not the target audience here.

No mention of RDMA compatibility. The MRC specification papers describe the SRv6 routing layer but aren't explicit about interaction with RDMA (Remote Direct Memory Access) and RoCEv2, which are used for high-bandwidth low-latency gradient synchronization in most GPU clusters today. Whether MRC complements or replaces RoCEv2 transport isn't yet answered in available public documentation.

The honest read on MRC is that it solves a real problem, at real scale, and is already proven in production. The more significant move is making it open. Prior art in this space - InfiniBand, proprietary switch firmware, custom congestion control algorithms - has historically been locked to specific vendors. Publishing through OCP means operators at other cloud providers, inference-focused infrastructure companies like Nebius, and even well-funded research labs can implement and iterate on the same standard.

Whether the five hardware partners ship interoperable silicon before the end of 2026, and whether other major cloud operators adopt the topology, is the thing to watch.

Sources: