Skip to main content

ByteDance veRoCE Makes RDMA Work on Lossy Networks

·653 words·4 mins
RDMA RoCE Data Center Networking AI Infrastructure ByteDance
Table of Contents

🚀 RDMA Meets the Reality of Modern AI Clusters
#

As large-scale AI training and inference systems continue to scale out, GPU-to-GPU communication efficiency has become a primary performance bottleneck. Traditional RDMA, particularly RoCEv2, offers excellent latency and CPU offload—but only under one strict condition: the network must be lossless.

This requirement forces operators to deploy Priority Flow Control (PFC) everywhere, introducing head-of-line blocking, complex tuning, and fragility at scale.

To address this, ByteDance has introduced veRoCE, a proprietary transport protocol that remains backward compatible with RoCEv2 while fundamentally removing RDMA’s dependency on lossless Ethernet. veRoCE allows RDMA traffic to operate correctly—even efficiently—on lossy, ECMP-based data center networks.

ByteDance veRoce


🧠 Core Design Philosophy: Evolve, Don’t Replace
#

veRoCE does not abandon the RDMA programming model. Instead, it preserves:

  • The standard ibverbs API
  • RDMA semantics (Read / Write / Send)
  • Existing NIC offload logic where possible

This ensures that applications and frameworks require minimal or no changes, while the transport layer gains resilience against real-world network behavior.


🧩 Key Architectural Innovations
#

Native Out-of-Order (OOO) Delivery
#

In standard RoCE, missing packets block progress. veRoCE removes this bottleneck by enabling Direct Data Placement (DDP) for out-of-order packets:

  • Each packet carries precise offset information
  • The NIC can place payload data directly into the correct memory location
  • No reassembly stall while waiting for earlier packets

This single change unlocks true tolerance to packet reordering.

Packet-Level Multi-Pathing
#

veRoCE supports packet spraying across multiple ECMP paths:

  • Maximizes bandwidth utilization
  • Eliminates the need for strict in-order delivery
  • Avoids flow pinning that limits scalability in large fabrics

Out-of-order arrival is no longer an error condition—it is expected behavior.

Selective Retransmission (SACK)
#

Instead of RoCE’s coarse Go-Back-N retransmission model, veRoCE introduces Selective Acknowledgment (SACK):

  • Only lost packets are retransmitted
  • Reduces redundant traffic
  • Improves recovery latency under congestion

This makes RDMA behave more like a high-performance, hardware-accelerated transport protocol rather than a fragile messaging layer.


🧱 Extended Transport Headers
#

To support these capabilities, veRoCE extends the RoCEv2 packet format and operates on UDP port 4794 (distinct from RoCEv2’s 4791).

Header Purpose
BTH Base routing; includes a new Retrans bit to mark retransmitted packets
MSNETH Tracks message-level ordering (WQE granularity)
POETH Provides byte-level offsets for DDP placement
SACKETH Carries a 128-bit bitmap indicating missing packets
RQETH Maps out-of-order Send packets to the correct Receive WQE

These extensions enable reliability without sacrificing hardware efficiency.


🔁 Dual-Sequence Reliability Model
#

veRoCE separates transport correctness into two layers:

  1. PSN (Packet Sequence Number)

    • Tracks individual packets
    • Handles loss and reordering
  2. MSN (Message Sequence Number)

    • Tracks completion of entire RDMA work requests
    • Ensures application-level correctness

Smart Retransmission Logic
#

  • Lazy SACK: SACKs are sent only when disorder exceeds a threshold, avoiding ACK storms
  • Fast Retransmit: Upon SACK reception, the sender retransmits only missing PSNs
  • RxtPSN Tracking: Prevents duplicate retransmissions within a single RTT

This design minimizes both latency and network overhead.


🌐 Congestion Awareness Without Fragility
#

Packet Trimming
#

veRoCE introduces support for Packet Trimming in congested switches:

  • Payload is dropped, header is preserved
  • Receiver detects partial packets
  • Immediate Packet Drop NAK triggers fast retransmission

This is significantly faster than waiting for timeout-based recovery.

Flexible Congestion Control (FCC)
#

Congestion signaling is decoupled from reliability mechanisms:

  • In-band: BECN bits in ACK or SACK packets
  • Out-of-band: Dedicated CNP (Congestion Notification Packets)

This allows modern rate-based congestion control algorithms to coexist with aggressive packet coalescing and batching.


🧠 Why veRoCE Matters for AI Infrastructure
#

veRoCE transforms RDMA from a delicate, lossless-only optimization into a robust, scalable transport suitable for hyperscale AI clusters. By embracing packet loss and reordering as normal conditions—rather than fatal errors—it aligns RDMA with how modern data center networks actually behave.

For AI workloads where thousands of GPUs communicate continuously across multi-tier fabrics, veRoCE represents a foundational shift:
RDMA without PFC, without fragility, and without sacrificing performance.

In effect, veRoCE redefines RDMA as the communication bedrock for the next generation of AI infrastructure.

Related

RDMA 技术简介
·37 words·1 min
RDMA InfiniBand RoCE
VXLAN Explained: Scalable Networking for Modern Data Centers
·609 words·3 mins
VXLAN Data Center Networking Network Virtualization
Memory Milestones: 256GB DDR5 and the Rising AI Tax
·583 words·3 mins
Memory DDR5 AI Infrastructure SK Hynix G.Skill Maxsun