Skip to main content

RDMA Explained: How Remote Direct Memory Access Works

·653 words·4 mins
AI RDMA HPC RoCEv2 Networking
Table of Contents

Remote Direct Memory Access (RDMA) is a key technology in High-Performance Computing (HPC), enabling ultra-fast and efficient data transfer between compute nodes. With RDMA over Converged Ethernet v2 (RoCEv2), organizations can achieve low-latency, lossless communication across data centers and AI workloads.

This article explains how RDMA works step-by-step — from memory registration to RDMA write operations — with diagrams to help you understand the process.

AI RDMA

Figure 1: Fine-Grained DRAM High-Level Architecture


What is RDMA and Why It Matters?
#

RDMA allows direct access to memory on remote systems without involving the CPU or OS, which significantly reduces latency and improves bandwidth efficiency. This is critical for:

  • AI and machine learning training (fast dataset transfer)
  • High-performance computing clusters (efficient inter-node communication)
  • Data-intensive workloads like real-time analytics, NVMe-over-Fabrics, and distributed storage

RoCEv2 uses UDP transport, which means it does not handle packet loss due to congestion. To maintain lossless data delivery, modern networks use:

  • Priority Flow Control (PFC) – prevents buffer overflow
  • Explicit Congestion Notification (ECN) – signals senders to slow down

RDMA Process Overview
#

When a client compute node (CCN) writes data to a server compute node (SCN), the process involves four main steps:

  1. Memory Allocation and Registration
  2. Queue Pair Creation
  3. Connection Initialization
  4. RDMA Write Operation

Each step is explained in detail below.


Step 1: Memory Allocation and Registration
#

  • Allocate a Protection Domain (PD), similar to a tenant or VRF in networking.
  • Register memory blocks, defining size and access permissions.
  • Receive keys:
    • L_Key (Local Key) for local access
    • R_Key (Remote Key) for remote write access

In our example:

  • CCN memory → local read access
  • SCN memory → remote write access

AI RDMA

Figure 2: Memory Allocation and Registration


Step 2: Queue Pair (QP) Creation
#

  • A Queue Pair (QP) = Send Queue + Receive Queue.
  • A Completion Queue (CQ) reports operation status.
  • Each QP is assigned a service type (Reliable Connection or Unreliable Datagram).

For reliable data transfer, we use RC (Reliable Connection).

  • Bind QP to PD and memory region.
  • Assign a Partition Key (P_Key), similar to VXLAN VNI.

Example:

  • CCN QP ID: 0x12345678
  • Associated P_Key: 0x8012

AI RDMA

Figure 3: Queue Pair Creation


Step 3: RDMA Connection Initialization
#

Connection setup involves REQ → REPLY → RTU messages:

  • REQ (Request): CCN sends Local ID, QP number, P_Key, and PSN.
  • Reply: SCN responds with IDs, QP info, and PSN.
  • RTU (Ready to Use): CCN confirms connection.

At the end, the QP state transitions from INIT → Ready to Send → Ready to Receive.

AI RDMA

Figure 4: RDMA Connection Initialization


Step 4: RDMA Write Operation
#

Once connected, the CCN application issues a Work Request (WR) containing:

  • OpCode: RDMA Write
  • Local buffer address + L_Key
  • Remote buffer address + R_Key
  • Payload length

The NIC then builds the required headers:

  • InfiniBand Base Transport Header (IB BTH) – includes P_Key and QP ID
  • RDMA Extended Transport Header (RETH) – includes R_Key and data length
  • UDP header (port 4791) indicates IB BTH follows

Data is encapsulated and sent over Ethernet/IP/UDP/IB BTH/RETH.

AI RDMA

Figure 5: Generating and Posting an RDMA Write Operation

On the SCN side:

  • Validate P_Key and R_Key
  • Translate virtual to physical memory
  • Deliver data to the QP’s receive queue
  • Notify application via the completion queue

AI RDMA

Figure 6: Receiving and Processing an RDMA Write Operation


Key Benefits of RDMA
#

  • 🚀 Ultra-low latency data transfer
  • High bandwidth efficiency for HPC and AI
  • 🔒 Bypasses CPU/OS overhead for direct memory access
  • 📡 Lossless transport with PFC + ECN
  • 🔄 Scalable design for large clusters and data centers

Conclusion
#

RDMA technology, especially with RoCEv2 over IP Fabrics, is essential for modern data-intensive workloads. By offloading memory operations from CPUs and enabling direct memory access across compute nodes, RDMA improves performance in AI training, big data analytics, distributed storage, and cloud-scale HPC systems.

Organizations adopting RDMA can expect:

  • Faster application performance
  • Lower latency in AI/ML pipelines
  • Better efficiency in multi-node HPC environments

🔗 Original article: Detailed Explanation of the RDMA Working Process

Related

Storage Requirements in LLM Training: Data and Checkpoints
·644 words·4 mins
AI LLM
Moore Threads Develops MTLink to Challenge Nvidia NVLink
·450 words·3 mins
AI Moore Thread MTLink NVIDIA NVLINK
NVIDIA HGX B200 NVLink Switch Changes Explained
·431 words·3 mins
AI NVLINK Switch HGX B200 H100