🚀 RDMA Meets the Reality of Modern AI Clusters #
As large-scale AI training and inference systems continue to scale out, GPU-to-GPU communication efficiency has become a primary performance bottleneck. Traditional RDMA, particularly RoCEv2, offers excellent latency and CPU offload—but only under one strict condition: the network must be lossless.
This requirement forces operators to deploy Priority Flow Control (PFC) everywhere, introducing head-of-line blocking, complex tuning, and fragility at scale.
To address this, ByteDance has introduced veRoCE, a proprietary transport protocol that remains backward compatible with RoCEv2 while fundamentally removing RDMA’s dependency on lossless Ethernet. veRoCE allows RDMA traffic to operate correctly—even efficiently—on lossy, ECMP-based data center networks.
🧠 Core Design Philosophy: Evolve, Don’t Replace #
veRoCE does not abandon the RDMA programming model. Instead, it preserves:
- The standard ibverbs API
- RDMA semantics (Read / Write / Send)
- Existing NIC offload logic where possible
This ensures that applications and frameworks require minimal or no changes, while the transport layer gains resilience against real-world network behavior.
🧩 Key Architectural Innovations #
Native Out-of-Order (OOO) Delivery #
In standard RoCE, missing packets block progress. veRoCE removes this bottleneck by enabling Direct Data Placement (DDP) for out-of-order packets:
- Each packet carries precise offset information
- The NIC can place payload data directly into the correct memory location
- No reassembly stall while waiting for earlier packets
This single change unlocks true tolerance to packet reordering.
Packet-Level Multi-Pathing #
veRoCE supports packet spraying across multiple ECMP paths:
- Maximizes bandwidth utilization
- Eliminates the need for strict in-order delivery
- Avoids flow pinning that limits scalability in large fabrics
Out-of-order arrival is no longer an error condition—it is expected behavior.
Selective Retransmission (SACK) #
Instead of RoCE’s coarse Go-Back-N retransmission model, veRoCE introduces Selective Acknowledgment (SACK):
- Only lost packets are retransmitted
- Reduces redundant traffic
- Improves recovery latency under congestion
This makes RDMA behave more like a high-performance, hardware-accelerated transport protocol rather than a fragile messaging layer.
🧱 Extended Transport Headers #
To support these capabilities, veRoCE extends the RoCEv2 packet format and operates on UDP port 4794 (distinct from RoCEv2’s 4791).
| Header | Purpose |
|---|---|
| BTH | Base routing; includes a new Retrans bit to mark retransmitted packets |
| MSNETH | Tracks message-level ordering (WQE granularity) |
| POETH | Provides byte-level offsets for DDP placement |
| SACKETH | Carries a 128-bit bitmap indicating missing packets |
| RQETH | Maps out-of-order Send packets to the correct Receive WQE |
These extensions enable reliability without sacrificing hardware efficiency.
🔁 Dual-Sequence Reliability Model #
veRoCE separates transport correctness into two layers:
-
PSN (Packet Sequence Number)
- Tracks individual packets
- Handles loss and reordering
-
MSN (Message Sequence Number)
- Tracks completion of entire RDMA work requests
- Ensures application-level correctness
Smart Retransmission Logic #
- Lazy SACK: SACKs are sent only when disorder exceeds a threshold, avoiding ACK storms
- Fast Retransmit: Upon SACK reception, the sender retransmits only missing PSNs
- RxtPSN Tracking: Prevents duplicate retransmissions within a single RTT
This design minimizes both latency and network overhead.
🌐 Congestion Awareness Without Fragility #
Packet Trimming #
veRoCE introduces support for Packet Trimming in congested switches:
- Payload is dropped, header is preserved
- Receiver detects partial packets
- Immediate Packet Drop NAK triggers fast retransmission
This is significantly faster than waiting for timeout-based recovery.
Flexible Congestion Control (FCC) #
Congestion signaling is decoupled from reliability mechanisms:
- In-band: BECN bits in ACK or SACK packets
- Out-of-band: Dedicated CNP (Congestion Notification Packets)
This allows modern rate-based congestion control algorithms to coexist with aggressive packet coalescing and batching.
🧠 Why veRoCE Matters for AI Infrastructure #
veRoCE transforms RDMA from a delicate, lossless-only optimization into a robust, scalable transport suitable for hyperscale AI clusters. By embracing packet loss and reordering as normal conditions—rather than fatal errors—it aligns RDMA with how modern data center networks actually behave.
For AI workloads where thousands of GPUs communicate continuously across multi-tier fabrics, veRoCE represents a foundational shift:
RDMA without PFC, without fragility, and without sacrificing performance.
In effect, veRoCE redefines RDMA as the communication bedrock for the next generation of AI infrastructure.