RFC 9800 Compressed SRv6: The Hidden Engine Behind AI Superclusters
The explosive growth of large-scale AI training has fundamentally reshaped datacenter network design.
As GPU cluster sizes scale from thousands to tens of thousands of accelerators, traditional Ethernet and RDMA networking architectures are increasingly struggling with congestion, fault recovery latency, and operational complexity. Conventional ECMP-based routing models were originally designed for relatively stable cloud workloadsβnot for synchronized AI training jobs that continuously exchange massive volumes of latency-sensitive traffic.
A major architectural transition is now emerging inside hyperscale AI infrastructure.
Driven by the IETF RFC 9800 standard for Compressed SRv6 Segment List Encoding, a new networking paradigm is shifting path selection intelligence from the network control plane to the endpoints themselves. Combined with the MRC (Multipath Reliable Connection) transport protocol described in a production-scale AI networking paper jointly released by OpenAI, Microsoft, AMD, Broadcom, and NVIDIA, Compressed SRv6 is becoming a foundational building block for next-generation AI superclusters.
This architecture enables:
- Deterministic source routing
- Microsecond-level fault recovery
- PFC-free lossy Ethernet operation
- Massive-scale path observability
- Stable operation across 100,000-GPU training clusters
The result is not merely an optimization of traditional datacenter networking, but a complete redesign of how AI infrastructure handles traffic engineering and fault resilience.
- π The Structural Limits of Traditional AI Datacenter Networking
Modern AI training clusters generate communication patterns fundamentally different from traditional cloud applications.
Distributed training frameworks continuously exchange gradients, parameters, synchronization signals, and tensor data across thousands of GPUs simultaneously. These workloads are highly sensitive to:
- Congestion
- Packet loss
- Tail latency
- Link imbalance
- Recovery delays
Most current AI datacenter networks still rely on the mainstream:
- RoCEv2
- ECMP
- Dynamic routing
- PFC-based lossless Ethernet
architecture.
In this model:
Control Plane #
Protocols such as BGP maintain global topology visibility and update routing tables after failures occur.
Data Plane #
Switches use ECMP hashing to distribute traffic across multiple paths based on packet 5-tuples.
Endpoint Layer #
Congestion control mechanisms such as DCQCN handle network congestion, while packet loss recovery relies on retransmission mechanisms like Go-back-N.
Although functional at moderate scale, this architecture begins to fail structurally in ultra-large AI clusters.
β οΈ ECMP Hash Collisions Become Inevitable #
AI traffic patterns are highly synchronized and bursty.
When many flows are hashed across a limited number of physical paths, collisions become unavoidable.
The production paper showed that in a 64-way Ring AllReduce workload:
- Single-QP RoCEv2 achieved only about 50% of theoretical throughput
- Even scaling to 16 QPs failed to fully utilize network bandwidth
The root cause is straightforward:
When:
$$ [ N \gg M ] $$
where:
$$
- (N) = active flows
- (M) = available physical paths $$
multiple flows inevitably collide on the same links.
This creates:
- Hotspot congestion
- Bandwidth imbalance
- Increased latency
- Reduced overall utilization
π Control Plane Recovery Is Too Slow #
Traditional networking handles failures through routing convergence.
The sequence typically looks like:
Link failure β detection β BGP convergence β FIB update β ECMP remapping.
This process can take:
- Hundreds of milliseconds
- Often multiple seconds
During that window:
- Packets continue flowing into failed paths
- Massive retransmissions occur
- AI jobs stall or lose efficiency
At hyperscale AI cluster sizes, seconds of disruption can severely impact overall training throughput.
π§ PFC Creates Head-of-Line Blocking #
Lossless Ethernet depends on PFC (Priority Flow Control) to avoid packet drops.
However, PFC introduces major side effects:
- Congestion spreading
- Head-of-line blocking
- Cascading backpressure
When multiple flows collide on the same bottleneck link, congestion propagates upstream, blocking unrelated traffic.
This phenomenon becomes increasingly destructive at large scale.
The OpenAI/Microsoft networking paper therefore made a striking architectural choice:
βWe take the unusual stance of disabling dynamic routing in the switches.β
That statement represents a fundamental philosophical shift in datacenter networking.
- π§ The Three Pillars of the New AI Networking Paradigm
The new architecture combines:
- Multi-plane topology
- MRC transport intelligence
- Compressed SRv6 source routing
Together, they fundamentally redefine traffic engineering in AI datacenters.
π Multi-Plane Physical Topology #
Instead of using a single large-scale fabric, the network is divided into multiple independent planes.
For example:
- One 800Gb/s NIC
- Split into eight 100Gb/s ports
- Distributed across independent network planes
This reduces the blast radius of any individual link or switch failure.
A single failure now impacts only a fraction of total bandwidth instead of disrupting the entire cluster.
π¦ Static Compressed SRv6 Source Routing #
Under RFC 9800 Compressed SRv6:
- The sender encodes the exact path directly into the packet header
- Switches simply forward deterministically
- No ECMP hashing is required
- No dynamic routing decisions occur inside the network
The network becomes intentionally simple.
All path intelligence moves to the endpoints.
β‘ Endpoint-Controlled Path Management #
MRC centralizes:
- Path selection
- Failure handling
- Retransmission logic
- Traffic scheduling
entirely at the source endpoints.
The endpoints no longer depend on:
- BGP convergence
- Switch telemetry
- Control-plane signaling
This transforms the network from a centrally reactive system into a locally adaptive one.
- π§© Why RFC 9800 Compressed SRv6 Matters
Traditional SRv6 alone would be impractical inside AI datacenters.
A standard SRv6 SID consumes:
$$ [ 128\text{ bits} ] $$
In a multi-hop Clos topology, packet headers would quickly become excessively large.
For AI training workloads involving many small packets, this overhead is unacceptable.
RFC 9800 solves this problem through compressed SID encoding.
π Compressed SID (CSID) Encoding #
RFC 9800 defines two major compression approaches:
NEXT-CSID (uSID) #
Each SID is processed and popped sequentially.
REPLACE-CSID (gSID) #
Each SID replaces the previous SID during processing.
The OpenAI/Microsoft implementation adopts the uSID model.
𧬠16-Bit uSID Compression #
Instead of storing full 128-bit SIDs, RFC 9800 compresses them into:
- 16-bit
- or 32-bit
segments.
A packet path may therefore look like:
32-bit locator + multiple 16-bit uSIDs + padding.
This dramatically reduces encapsulation overhead.
Efficiency Improvement #
For a four-hop path:
Traditional SRv6 #
- ~112 bytes overhead per packet
Compressed uSID #
- ~64 bytes overhead
Optimized Inline Encoding #
- ~40 bytes overhead
This reduction is critical for production AI networking.
Without compression, SRv6 would remain largely impractical at hyperscale.
- π EV Mapping: Turning Paths into Deterministic Intelligence
MRC introduces the concept of:
$$ [ EV = Entropy\ Value ] $$
a 32-bit identifier composed from:
- UDP source port
- IPv6 flow label
In traditional ECMP systems, entropy values are merely hash inputs.
The endpoint has no visibility into the resulting physical path.
MRC changes this completely.
Each EV explicitly maps to:
- A specific plane
- A specific uplink
- A specific downlink
- A complete deterministic path
The sender can therefore construct exact SRv6 destination addresses algorithmically.
This provides full end-to-end path awareness at the source.
The endpoint no longer guesses where traffic will flow.
It knows precisely.
- β‘ From Seconds to Microseconds: Reinventing Failure Recovery
The most transformative benefit of MRC + Compressed SRv6 is fault recovery speed.
β Traditional ECMP Recovery #
In conventional networks:
- Link fails
- Control plane detects failure
- Routing converges
- FIB updates propagate
- Hash mappings change
- Endpoints retransmit
Total recovery time:
- Hundreds of milliseconds
- Often several seconds
This is catastrophic for synchronized AI workloads.
β MRC + SRv6 Recovery #
Under MRC:
- Packet loss detected locally
- EV immediately marked bad
- EV removed from active path set
- New packets rerouted through backup paths
- Lost packets selectively retransmitted via SACK
Total recovery time:
- Tens of microseconds
This is several orders of magnitude faster.
π Real Production Results #
The production paper reported remarkable results:
50,000-GPU Cluster #
During repeated T0 link flapping:
- Throughput briefly dipped ~25%
- Instantly recovered
- No node dropouts
- No QP disconnects
75,000-GPU Cluster #
A T1 switch experienced silent forwarding failure while still appearing healthy to the control plane.
MRC automatically:
- Identified affected EVs
- Rerouted traffic
- Maintained training continuity
The failed switch was rebooted live with effectively zero impact on training throughput.
This demonstrates the true power of endpoint-driven recovery.
- π« Why PFC Can Finally Be Disabled
One of the most important implications of deterministic SRv6 routing is the ability to operate on lossy Ethernet.
This becomes possible because:
Deterministic Paths Reduce Flow Collisions #
Explicit source routing prevents random ECMP collisions.
Fast SACK Retransmission Replaces Lossless Guarantees #
Instead of preventing packet drops entirely, MRC rapidly retransmits only the missing packets.
This avoids:
- Go-back-N inefficiency
- Congestion spreading
- PFC backpressure storms
The result is significantly better bandwidth sharing under congestion.
- π The O&M Paradigm Also Changes
The shift to deterministic SRv6 networking also transforms operations and maintenance.
π οΈ Link Failures Become Routine Events #
Traditional datacenters treat link flaps as urgent incidents.
Under MRC:
- Endpoints automatically bypass failed paths
- Traffic self-heals
- Links reintegrate automatically
Many failures no longer require immediate human intervention.
π Switch Reboots Become Transparent #
Because forwarding logic no longer depends heavily on centralized control-plane convergence:
- Switches can be rebooted live
- Training jobs continue uninterrupted
- Maintenance complexity drops dramatically
π‘ Clustermapper Enables Precise Observability #
The paper also describes a tool called Clustermapper.
Using deterministic SRv6 paths, it sends probes along exact production forwarding routes.
This provides:
- Precise path observability
- Real forwarding-state verification
- Accurate fault localization
Traditional probabilistic telemetry approaches become unnecessary.
- π Conclusion
RFC 9800 Compressed SRv6, combined with MRC, represents a major architectural shift in AI datacenter networking.
Its core principles are remarkably simple:
- Paths are determined by endpoints
- Network states are sensed locally
- Failures are bypassed instantly
This eliminates dependence on:
- ECMP randomness
- Slow control-plane convergence
- PFC-driven lossless fabrics
The impact is profound.
Failure recovery moves from:
- Seconds
- to microseconds
Network failures evolve from catastrophic cluster events into minor transient bandwidth fluctuations.
Most importantly, this architecture makes stable operation of 100,000-GPU AI superclusters not merely possible, but operationally practical.
As AI infrastructure continues scaling toward ever larger clusters, endpoint-controlled deterministic networking may ultimately become the dominant networking paradigm for the next generation of intelligent computing systems.
Appendix: π What Is RFC 9800?
RFC 9800, officially titled:
Compressed SRv6 Segment List Encoding
was published by the IETF in June 2025.
The standard defines mechanisms for compressing SRv6 Segment Identifiers, enabling SRv6 deployment at scale inside hardware-forwarded production environments.
RFC 9800 support has already been implemented across:
- Linux Kernel
- SONiC
- Cisco
- Huawei
- Juniper
- Broadcom
- Marvell
- ZTE
- and other major vendors
By the end of 2025, more than 300 WAN deployments using compressed SRv6 had reportedly been completed globally.
The adoption of RFC 9800 inside production AI infrastructure from OpenAI, Microsoft, NVIDIA, AMD, and Broadcom strongly suggests that compressed SRv6 is evolving from a telecom-focused technology into a foundational protocol for hyperscale AI computing.
Notably, RFC 9800βs first author is Chinese networking expert Cheng Weiqiang from China Mobile, who also serves as co-chair of the IETF SRv6 OPS Working Group.
The standardβs contributor list includes experts from:
- Cisco
- Huawei
- Orange
- Alibaba
- NTT
- ZTE
and multiple other global networking organizations, reflecting the increasingly international nature of next-generation Internet protocol development.