Skip to main content

RFC 9800 Compressed SRv6: The Hidden Engine Behind AI Superclusters

·1781 words·9 mins
RFC 9800 SRv6 AI Datacenter Networking MRC OpenAI Microsoft RDMA Ethernet High-Performance Computing
Table of Contents

RFC 9800 Compressed SRv6: The Hidden Engine Behind AI Superclusters

The explosive growth of large-scale AI training has fundamentally reshaped datacenter network design.

As GPU cluster sizes scale from thousands to tens of thousands of accelerators, traditional Ethernet and RDMA networking architectures are increasingly struggling with congestion, fault recovery latency, and operational complexity. Conventional ECMP-based routing models were originally designed for relatively stable cloud workloadsβ€”not for synchronized AI training jobs that continuously exchange massive volumes of latency-sensitive traffic.

A major architectural transition is now emerging inside hyperscale AI infrastructure.

Driven by the IETF RFC 9800 standard for Compressed SRv6 Segment List Encoding, a new networking paradigm is shifting path selection intelligence from the network control plane to the endpoints themselves. Combined with the MRC (Multipath Reliable Connection) transport protocol described in a production-scale AI networking paper jointly released by OpenAI, Microsoft, AMD, Broadcom, and NVIDIA, Compressed SRv6 is becoming a foundational building block for next-generation AI superclusters.

This architecture enables:

  • Deterministic source routing
  • Microsecond-level fault recovery
  • PFC-free lossy Ethernet operation
  • Massive-scale path observability
  • Stable operation across 100,000-GPU training clusters

The result is not merely an optimization of traditional datacenter networking, but a complete redesign of how AI infrastructure handles traffic engineering and fault resilience.


  1. 🌐 The Structural Limits of Traditional AI Datacenter Networking

Modern AI training clusters generate communication patterns fundamentally different from traditional cloud applications.

Distributed training frameworks continuously exchange gradients, parameters, synchronization signals, and tensor data across thousands of GPUs simultaneously. These workloads are highly sensitive to:

  • Congestion
  • Packet loss
  • Tail latency
  • Link imbalance
  • Recovery delays

Most current AI datacenter networks still rely on the mainstream:

  • RoCEv2
  • ECMP
  • Dynamic routing
  • PFC-based lossless Ethernet

architecture.

In this model:

Control Plane
#

Protocols such as BGP maintain global topology visibility and update routing tables after failures occur.

Data Plane
#

Switches use ECMP hashing to distribute traffic across multiple paths based on packet 5-tuples.

Endpoint Layer
#

Congestion control mechanisms such as DCQCN handle network congestion, while packet loss recovery relies on retransmission mechanisms like Go-back-N.

Although functional at moderate scale, this architecture begins to fail structurally in ultra-large AI clusters.


⚠️ ECMP Hash Collisions Become Inevitable
#

AI traffic patterns are highly synchronized and bursty.

When many flows are hashed across a limited number of physical paths, collisions become unavoidable.

The production paper showed that in a 64-way Ring AllReduce workload:

  • Single-QP RoCEv2 achieved only about 50% of theoretical throughput
  • Even scaling to 16 QPs failed to fully utilize network bandwidth

The root cause is straightforward:

When:

$$ [ N \gg M ] $$

where:

$$

  • (N) = active flows
  • (M) = available physical paths $$

multiple flows inevitably collide on the same links.

This creates:

  • Hotspot congestion
  • Bandwidth imbalance
  • Increased latency
  • Reduced overall utilization

πŸ•’ Control Plane Recovery Is Too Slow
#

Traditional networking handles failures through routing convergence.

The sequence typically looks like:

Link failure β†’ detection β†’ BGP convergence β†’ FIB update β†’ ECMP remapping.

This process can take:

  • Hundreds of milliseconds
  • Often multiple seconds

During that window:

  • Packets continue flowing into failed paths
  • Massive retransmissions occur
  • AI jobs stall or lose efficiency

At hyperscale AI cluster sizes, seconds of disruption can severely impact overall training throughput.


🚧 PFC Creates Head-of-Line Blocking
#

Lossless Ethernet depends on PFC (Priority Flow Control) to avoid packet drops.

However, PFC introduces major side effects:

  • Congestion spreading
  • Head-of-line blocking
  • Cascading backpressure

When multiple flows collide on the same bottleneck link, congestion propagates upstream, blocking unrelated traffic.

This phenomenon becomes increasingly destructive at large scale.

The OpenAI/Microsoft networking paper therefore made a striking architectural choice:

β€œWe take the unusual stance of disabling dynamic routing in the switches.”

That statement represents a fundamental philosophical shift in datacenter networking.


  1. 🧠 The Three Pillars of the New AI Networking Paradigm

The new architecture combines:

  • Multi-plane topology
  • MRC transport intelligence
  • Compressed SRv6 source routing

Together, they fundamentally redefine traffic engineering in AI datacenters.


πŸ”€ Multi-Plane Physical Topology
#

Instead of using a single large-scale fabric, the network is divided into multiple independent planes.

For example:

  • One 800Gb/s NIC
  • Split into eight 100Gb/s ports
  • Distributed across independent network planes

This reduces the blast radius of any individual link or switch failure.

A single failure now impacts only a fraction of total bandwidth instead of disrupting the entire cluster.


πŸ“¦ Static Compressed SRv6 Source Routing
#

Under RFC 9800 Compressed SRv6:

  • The sender encodes the exact path directly into the packet header
  • Switches simply forward deterministically
  • No ECMP hashing is required
  • No dynamic routing decisions occur inside the network

The network becomes intentionally simple.

All path intelligence moves to the endpoints.


⚑ Endpoint-Controlled Path Management
#

MRC centralizes:

  • Path selection
  • Failure handling
  • Retransmission logic
  • Traffic scheduling

entirely at the source endpoints.

The endpoints no longer depend on:

  • BGP convergence
  • Switch telemetry
  • Control-plane signaling

This transforms the network from a centrally reactive system into a locally adaptive one.


  1. 🧩 Why RFC 9800 Compressed SRv6 Matters

Traditional SRv6 alone would be impractical inside AI datacenters.

A standard SRv6 SID consumes:

$$ [ 128\text{ bits} ] $$

In a multi-hop Clos topology, packet headers would quickly become excessively large.

For AI training workloads involving many small packets, this overhead is unacceptable.

RFC 9800 solves this problem through compressed SID encoding.


πŸ“‰ Compressed SID (CSID) Encoding
#

RFC 9800 defines two major compression approaches:

NEXT-CSID (uSID)
#

Each SID is processed and popped sequentially.

REPLACE-CSID (gSID)
#

Each SID replaces the previous SID during processing.

The OpenAI/Microsoft implementation adopts the uSID model.


🧬 16-Bit uSID Compression
#

Instead of storing full 128-bit SIDs, RFC 9800 compresses them into:

  • 16-bit
  • or 32-bit

segments.

A packet path may therefore look like:

32-bit locator + multiple 16-bit uSIDs + padding.

This dramatically reduces encapsulation overhead.

Efficiency Improvement
#

For a four-hop path:

Traditional SRv6
#

  • ~112 bytes overhead per packet

Compressed uSID
#

  • ~64 bytes overhead

Optimized Inline Encoding
#

  • ~40 bytes overhead

This reduction is critical for production AI networking.

Without compression, SRv6 would remain largely impractical at hyperscale.


  1. πŸ”„ EV Mapping: Turning Paths into Deterministic Intelligence

MRC introduces the concept of:

$$ [ EV = Entropy\ Value ] $$

a 32-bit identifier composed from:

  • UDP source port
  • IPv6 flow label

In traditional ECMP systems, entropy values are merely hash inputs.

The endpoint has no visibility into the resulting physical path.

MRC changes this completely.

Each EV explicitly maps to:

  • A specific plane
  • A specific uplink
  • A specific downlink
  • A complete deterministic path

The sender can therefore construct exact SRv6 destination addresses algorithmically.

This provides full end-to-end path awareness at the source.

The endpoint no longer guesses where traffic will flow.

It knows precisely.


  1. ⚑ From Seconds to Microseconds: Reinventing Failure Recovery

The most transformative benefit of MRC + Compressed SRv6 is fault recovery speed.


❌ Traditional ECMP Recovery
#

In conventional networks:

  1. Link fails
  2. Control plane detects failure
  3. Routing converges
  4. FIB updates propagate
  5. Hash mappings change
  6. Endpoints retransmit

Total recovery time:

  • Hundreds of milliseconds
  • Often several seconds

This is catastrophic for synchronized AI workloads.


βœ… MRC + SRv6 Recovery
#

Under MRC:

  1. Packet loss detected locally
  2. EV immediately marked bad
  3. EV removed from active path set
  4. New packets rerouted through backup paths
  5. Lost packets selectively retransmitted via SACK

Total recovery time:

  • Tens of microseconds

This is several orders of magnitude faster.


πŸ“Š Real Production Results
#

The production paper reported remarkable results:

50,000-GPU Cluster
#

During repeated T0 link flapping:

  • Throughput briefly dipped ~25%
  • Instantly recovered
  • No node dropouts
  • No QP disconnects

75,000-GPU Cluster
#

A T1 switch experienced silent forwarding failure while still appearing healthy to the control plane.

MRC automatically:

  • Identified affected EVs
  • Rerouted traffic
  • Maintained training continuity

The failed switch was rebooted live with effectively zero impact on training throughput.

This demonstrates the true power of endpoint-driven recovery.


  1. 🚫 Why PFC Can Finally Be Disabled

One of the most important implications of deterministic SRv6 routing is the ability to operate on lossy Ethernet.

This becomes possible because:

Deterministic Paths Reduce Flow Collisions
#

Explicit source routing prevents random ECMP collisions.

Fast SACK Retransmission Replaces Lossless Guarantees
#

Instead of preventing packet drops entirely, MRC rapidly retransmits only the missing packets.

This avoids:

  • Go-back-N inefficiency
  • Congestion spreading
  • PFC backpressure storms

The result is significantly better bandwidth sharing under congestion.


  1. πŸ” The O&M Paradigm Also Changes

The shift to deterministic SRv6 networking also transforms operations and maintenance.


πŸ› οΈ Link Failures Become Routine Events #

Traditional datacenters treat link flaps as urgent incidents.

Under MRC:

  • Endpoints automatically bypass failed paths
  • Traffic self-heals
  • Links reintegrate automatically

Many failures no longer require immediate human intervention.


πŸ”„ Switch Reboots Become Transparent
#

Because forwarding logic no longer depends heavily on centralized control-plane convergence:

  • Switches can be rebooted live
  • Training jobs continue uninterrupted
  • Maintenance complexity drops dramatically

πŸ“‘ Clustermapper Enables Precise Observability
#

The paper also describes a tool called Clustermapper.

Using deterministic SRv6 paths, it sends probes along exact production forwarding routes.

This provides:

  • Precise path observability
  • Real forwarding-state verification
  • Accurate fault localization

Traditional probabilistic telemetry approaches become unnecessary.


  1. πŸš€ Conclusion

RFC 9800 Compressed SRv6, combined with MRC, represents a major architectural shift in AI datacenter networking.

Its core principles are remarkably simple:

  1. Paths are determined by endpoints
  2. Network states are sensed locally
  3. Failures are bypassed instantly

This eliminates dependence on:

  • ECMP randomness
  • Slow control-plane convergence
  • PFC-driven lossless fabrics

The impact is profound.

Failure recovery moves from:

  • Seconds
  • to microseconds

Network failures evolve from catastrophic cluster events into minor transient bandwidth fluctuations.

Most importantly, this architecture makes stable operation of 100,000-GPU AI superclusters not merely possible, but operationally practical.

As AI infrastructure continues scaling toward ever larger clusters, endpoint-controlled deterministic networking may ultimately become the dominant networking paradigm for the next generation of intelligent computing systems.


Appendix: πŸ“˜ What Is RFC 9800?

RFC 9800, officially titled:

Compressed SRv6 Segment List Encoding

was published by the IETF in June 2025.

The standard defines mechanisms for compressing SRv6 Segment Identifiers, enabling SRv6 deployment at scale inside hardware-forwarded production environments.

RFC 9800 support has already been implemented across:

  • Linux Kernel
  • SONiC
  • Cisco
  • Huawei
  • Juniper
  • Broadcom
  • Marvell
  • ZTE
  • and other major vendors

By the end of 2025, more than 300 WAN deployments using compressed SRv6 had reportedly been completed globally.

The adoption of RFC 9800 inside production AI infrastructure from OpenAI, Microsoft, NVIDIA, AMD, and Broadcom strongly suggests that compressed SRv6 is evolving from a telecom-focused technology into a foundational protocol for hyperscale AI computing.

Notably, RFC 9800’s first author is Chinese networking expert Cheng Weiqiang from China Mobile, who also serves as co-chair of the IETF SRv6 OPS Working Group.

The standard’s contributor list includes experts from:

  • Cisco
  • Huawei
  • Orange
  • Alibaba
  • NTT
  • ZTE

and multiple other global networking organizations, reflecting the increasingly international nature of next-generation Internet protocol development.

Related

OpenAI MRC Protocol Powers 100,000-GPU AI Superclusters
·1211 words·6 mins
OpenAI MRC AI Networking RDMA RoCE NVIDIA AMD Broadcom Microsoft Supercomputing
MRC Protocol Redefines AI Supercomputer Networking
·1182 words·6 mins
OpenAI MRC AI Networking NVIDIA AMD Intel Broadcom Ethernet RoCE Supercomputing
UALink 2.0 Explained: The Open AI Fabric Challenging NVLink
·615 words·3 mins
UALink AI Infrastructure Interconnect Chiplets Data Center High-Performance Computing Networking