RFC 9800 Compressed SRv6: The Hidden Engine Behind AI Superclusters

Table of Contents

RFC 9800 Compressed SRv6: The Hidden Engine Behind AI Superclusters

The explosive growth of large-scale AI training has fundamentally reshaped datacenter network design.

As GPU cluster sizes scale from thousands to tens of thousands of accelerators, traditional Ethernet and RDMA networking architectures are increasingly struggling with congestion, fault recovery latency, and operational complexity. Conventional ECMP-based routing models were originally designed for relatively stable cloud workloads—not for synchronized AI training jobs that continuously exchange massive volumes of latency-sensitive traffic.

A major architectural transition is now emerging inside hyperscale AI infrastructure.

Driven by the IETF RFC 9800 standard for Compressed SRv6 Segment List Encoding, a new networking paradigm is shifting path selection intelligence from the network control plane to the endpoints themselves. Combined with the MRC (Multipath Reliable Connection) transport protocol described in a production-scale AI networking paper jointly released by OpenAI, Microsoft, AMD, Broadcom, and NVIDIA, Compressed SRv6 is becoming a foundational building block for next-generation AI superclusters.

This architecture enables:

Deterministic source routing
Microsecond-level fault recovery
PFC-free lossy Ethernet operation
Massive-scale path observability
Stable operation across 100,000-GPU training clusters

The result is not merely an optimization of traditional datacenter networking, but a complete redesign of how AI infrastructure handles traffic engineering and fault resilience.

🌐 The Structural Limits of Traditional AI Datacenter Networking

Modern AI training clusters generate communication patterns fundamentally different from traditional cloud applications.

Distributed training frameworks continuously exchange gradients, parameters, synchronization signals, and tensor data across thousands of GPUs simultaneously. These workloads are highly sensitive to:

Congestion
Packet loss
Tail latency
Link imbalance
Recovery delays

Most current AI datacenter networks still rely on the mainstream:

RoCEv2
ECMP
Dynamic routing
PFC-based lossless Ethernet

architecture.

In this model:

Control Plane
#

Protocols such as BGP maintain global topology visibility and update routing tables after failures occur.

Data Plane
#

Switches use ECMP hashing to distribute traffic across multiple paths based on packet 5-tuples.

Endpoint Layer
#

Congestion control mechanisms such as DCQCN handle network congestion, while packet loss recovery relies on retransmission mechanisms like Go-back-N.

Although functional at moderate scale, this architecture begins to fail structurally in ultra-large AI clusters.

⚠️ ECMP Hash Collisions Become Inevitable
#

AI traffic patterns are highly synchronized and bursty.

When many flows are hashed across a limited number of physical paths, collisions become unavoidable.

The production paper showed that in a 64-way Ring AllReduce workload:

Single-QP RoCEv2 achieved only about 50% of theoretical throughput
Even scaling to 16 QPs failed to fully utilize network bandwidth

The root cause is straightforward:

When:

$$ [ N \gg M ] $$

where:

(N) = active flows
(M) = available physical paths $$

multiple flows inevitably collide on the same links.

This creates:

Hotspot congestion
Bandwidth imbalance
Increased latency
Reduced overall utilization

🕒 Control Plane Recovery Is Too Slow
#

Traditional networking handles failures through routing convergence.

The sequence typically looks like:

Link failure → detection → BGP convergence → FIB update → ECMP remapping.

This process can take:

Hundreds of milliseconds
Often multiple seconds

During that window:

Packets continue flowing into failed paths
Massive retransmissions occur
AI jobs stall or lose efficiency

At hyperscale AI cluster sizes, seconds of disruption can severely impact overall training throughput.

🚧 PFC Creates Head-of-Line Blocking
#

Lossless Ethernet depends on PFC (Priority Flow Control) to avoid packet drops.

However, PFC introduces major side effects:

Congestion spreading
Head-of-line blocking
Cascading backpressure

When multiple flows collide on the same bottleneck link, congestion propagates upstream, blocking unrelated traffic.

This phenomenon becomes increasingly destructive at large scale.

The OpenAI/Microsoft networking paper therefore made a striking architectural choice:

“We take the unusual stance of disabling dynamic routing in the switches.”

That statement represents a fundamental philosophical shift in datacenter networking.

🧠 The Three Pillars of the New AI Networking Paradigm

The new architecture combines:

Multi-plane topology
MRC transport intelligence
Compressed SRv6 source routing

Together, they fundamentally redefine traffic engineering in AI datacenters.

🔀 Multi-Plane Physical Topology
#

Instead of using a single large-scale fabric, the network is divided into multiple independent planes.

For example:

One 800Gb/s NIC
Split into eight 100Gb/s ports
Distributed across independent network planes

This reduces the blast radius of any individual link or switch failure.

A single failure now impacts only a fraction of total bandwidth instead of disrupting the entire cluster.

📦 Static Compressed SRv6 Source Routing
#

Under RFC 9800 Compressed SRv6:

The sender encodes the exact path directly into the packet header
Switches simply forward deterministically
No ECMP hashing is required
No dynamic routing decisions occur inside the network

The network becomes intentionally simple.

All path intelligence moves to the endpoints.

⚡ Endpoint-Controlled Path Management
#

MRC centralizes:

Path selection
Failure handling
Retransmission logic
Traffic scheduling

entirely at the source endpoints.

The endpoints no longer depend on:

BGP convergence
Switch telemetry
Control-plane signaling

This transforms the network from a centrally reactive system into a locally adaptive one.

🧩 Why RFC 9800 Compressed SRv6 Matters

Traditional SRv6 alone would be impractical inside AI datacenters.

A standard SRv6 SID consumes:

$$ [ 128\text{ bits} ] $$

In a multi-hop Clos topology, packet headers would quickly become excessively large.

For AI training workloads involving many small packets, this overhead is unacceptable.

RFC 9800 solves this problem through compressed SID encoding.

📉 Compressed SID (CSID) Encoding
#

RFC 9800 defines two major compression approaches:

NEXT-CSID (uSID)
#

Each SID is processed and popped sequentially.

REPLACE-CSID (gSID)
#

Each SID replaces the previous SID during processing.

The OpenAI/Microsoft implementation adopts the uSID model.

🧬 16-Bit uSID Compression
#

Instead of storing full 128-bit SIDs, RFC 9800 compresses them into:

16-bit
or 32-bit

segments.

A packet path may therefore look like:

32-bit locator + multiple 16-bit uSIDs + padding.

This dramatically reduces encapsulation overhead.

Efficiency Improvement
#

For a four-hop path:

Traditional SRv6
#

~112 bytes overhead per packet

Compressed uSID
#

~64 bytes overhead

Optimized Inline Encoding
#

~40 bytes overhead

This reduction is critical for production AI networking.

Without compression, SRv6 would remain largely impractical at hyperscale.

🔄 EV Mapping: Turning Paths into Deterministic Intelligence

MRC introduces the concept of:

$$ [ EV = Entropy\ Value ] $$

a 32-bit identifier composed from:

UDP source port
IPv6 flow label

In traditional ECMP systems, entropy values are merely hash inputs.

The endpoint has no visibility into the resulting physical path.

MRC changes this completely.

Each EV explicitly maps to:

A specific plane
A specific uplink
A specific downlink
A complete deterministic path

The sender can therefore construct exact SRv6 destination addresses algorithmically.

This provides full end-to-end path awareness at the source.

The endpoint no longer guesses where traffic will flow.

It knows precisely.

⚡ From Seconds to Microseconds: Reinventing Failure Recovery

The most transformative benefit of MRC + Compressed SRv6 is fault recovery speed.

❌ Traditional ECMP Recovery
#

In conventional networks:

Link fails
Control plane detects failure
Routing converges
FIB updates propagate
Hash mappings change
Endpoints retransmit

Total recovery time:

Hundreds of milliseconds
Often several seconds

This is catastrophic for synchronized AI workloads.

✅ MRC + SRv6 Recovery
#

Under MRC:

Packet loss detected locally
EV immediately marked bad
EV removed from active path set
New packets rerouted through backup paths
Lost packets selectively retransmitted via SACK

Total recovery time:

Tens of microseconds

This is several orders of magnitude faster.

📊 Real Production Results
#

The production paper reported remarkable results:

50,000-GPU Cluster
#

During repeated T0 link flapping:

Throughput briefly dipped ~25%
Instantly recovered
No node dropouts
No QP disconnects

75,000-GPU Cluster
#

A T1 switch experienced silent forwarding failure while still appearing healthy to the control plane.

MRC automatically:

Identified affected EVs
Rerouted traffic
Maintained training continuity

The failed switch was rebooted live with effectively zero impact on training throughput.

This demonstrates the true power of endpoint-driven recovery.

🚫 Why PFC Can Finally Be Disabled

One of the most important implications of deterministic SRv6 routing is the ability to operate on lossy Ethernet.

This becomes possible because:

Deterministic Paths Reduce Flow Collisions
#

Explicit source routing prevents random ECMP collisions.

Fast SACK Retransmission Replaces Lossless Guarantees
#

Instead of preventing packet drops entirely, MRC rapidly retransmits only the missing packets.

This avoids:

Go-back-N inefficiency
Congestion spreading
PFC backpressure storms

The result is significantly better bandwidth sharing under congestion.

🔍 The O&M Paradigm Also Changes

The shift to deterministic SRv6 networking also transforms operations and maintenance.

🛠️ Link Failures Become Routine Events
#

Traditional datacenters treat link flaps as urgent incidents.

Under MRC:

Endpoints automatically bypass failed paths
Traffic self-heals
Links reintegrate automatically

Many failures no longer require immediate human intervention.

🔄 Switch Reboots Become Transparent
#

Because forwarding logic no longer depends heavily on centralized control-plane convergence:

Switches can be rebooted live
Training jobs continue uninterrupted
Maintenance complexity drops dramatically

📡 Clustermapper Enables Precise Observability
#

The paper also describes a tool called Clustermapper.

Using deterministic SRv6 paths, it sends probes along exact production forwarding routes.

This provides:

Precise path observability
Real forwarding-state verification
Accurate fault localization

Traditional probabilistic telemetry approaches become unnecessary.

🚀 Conclusion

RFC 9800 Compressed SRv6, combined with MRC, represents a major architectural shift in AI datacenter networking.

Its core principles are remarkably simple:

Paths are determined by endpoints
Network states are sensed locally
Failures are bypassed instantly

This eliminates dependence on:

ECMP randomness
Slow control-plane convergence
PFC-driven lossless fabrics

The impact is profound.

Failure recovery moves from:

Seconds
to microseconds

Network failures evolve from catastrophic cluster events into minor transient bandwidth fluctuations.

Most importantly, this architecture makes stable operation of 100,000-GPU AI superclusters not merely possible, but operationally practical.

As AI infrastructure continues scaling toward ever larger clusters, endpoint-controlled deterministic networking may ultimately become the dominant networking paradigm for the next generation of intelligent computing systems.

Appendix: 📘 What Is RFC 9800?

RFC 9800, officially titled:

Compressed SRv6 Segment List Encoding

was published by the IETF in June 2025.

The standard defines mechanisms for compressing SRv6 Segment Identifiers, enabling SRv6 deployment at scale inside hardware-forwarded production environments.

RFC 9800 support has already been implemented across:

Linux Kernel
SONiC
Cisco
Huawei
Juniper
Broadcom
Marvell
ZTE
and other major vendors

By the end of 2025, more than 300 WAN deployments using compressed SRv6 had reportedly been completed globally.

The adoption of RFC 9800 inside production AI infrastructure from OpenAI, Microsoft, NVIDIA, AMD, and Broadcom strongly suggests that compressed SRv6 is evolving from a telecom-focused technology into a foundational protocol for hyperscale AI computing.

Notably, RFC 9800’s first author is Chinese networking expert Cheng Weiqiang from China Mobile, who also serves as co-chair of the IETF SRv6 OPS Working Group.

The standard’s contributor list includes experts from:

Cisco
Huawei
Orange
Alibaba
NTT
ZTE

and multiple other global networking organizations, reflecting the increasingly international nature of next-generation Internet protocol development.

OpenAI MRC Protocol Powers 100,000-GPU AI Superclusters

8 May 2026·1211 words·6 mins

OpenAI MRC AI Networking RDMA RoCE NVIDIA AMD Broadcom Microsoft Supercomputing

MRC Protocol Redefines AI Supercomputer Networking

8 May 2026·1182 words·6 mins

OpenAI MRC AI Networking NVIDIA AMD Intel Broadcom Ethernet RoCE Supercomputing

UALink 2.0 Explained: The Open AI Fabric Challenging NVLink

9 April 2026·615 words·3 mins

UALink AI Infrastructure Interconnect Chiplets Data Center High-Performance Computing Networking

Control Plane #

Data Plane #

Endpoint Layer #

⚠️ ECMP Hash Collisions Become Inevitable #

🕒 Control Plane Recovery Is Too Slow #

🚧 PFC Creates Head-of-Line Blocking #

🔀 Multi-Plane Physical Topology #

📦 Static Compressed SRv6 Source Routing #

⚡ Endpoint-Controlled Path Management #

📉 Compressed SID (CSID) Encoding #

NEXT-CSID (uSID) #

REPLACE-CSID (gSID) #

🧬 16-Bit uSID Compression #

Efficiency Improvement #

Traditional SRv6 #

Compressed uSID #

Optimized Inline Encoding #

❌ Traditional ECMP Recovery #

✅ MRC + SRv6 Recovery #

📊 Real Production Results #

50,000-GPU Cluster #

75,000-GPU Cluster #

Deterministic Paths Reduce Flow Collisions #

Fast SACK Retransmission Replaces Lossless Guarantees #

🛠️ Link Failures Become Routine Events #

🔄 Switch Reboots Become Transparent #

📡 Clustermapper Enables Precise Observability #

Related