Remote Direct Memory Access (RDMA) is a key technology in High-Performance Computing (HPC), enabling ultra-fast and efficient data transfer between compute nodes. With RDMA over Converged Ethernet v2 (RoCEv2), organizations can achieve low-latency, lossless communication across data centers and AI workloads.
This article explains how RDMA works step-by-step — from memory registration to RDMA write operations — with diagrams to help you understand the process.
  
    
  
Figure 1: Fine-Grained DRAM High-Level Architecture
What is RDMA and Why It Matters? #
RDMA allows direct access to memory on remote systems without involving the CPU or OS, which significantly reduces latency and improves bandwidth efficiency. This is critical for:
- AI and machine learning training (fast dataset transfer)
 - High-performance computing clusters (efficient inter-node communication)
 - Data-intensive workloads like real-time analytics, NVMe-over-Fabrics, and distributed storage
 
RoCEv2 uses UDP transport, which means it does not handle packet loss due to congestion. To maintain lossless data delivery, modern networks use:
- Priority Flow Control (PFC) – prevents buffer overflow
 - Explicit Congestion Notification (ECN) – signals senders to slow down
 
RDMA Process Overview #
When a client compute node (CCN) writes data to a server compute node (SCN), the process involves four main steps:
- Memory Allocation and Registration
 - Queue Pair Creation
 - Connection Initialization
 - RDMA Write Operation
 
Each step is explained in detail below.
Step 1: Memory Allocation and Registration #
- Allocate a Protection Domain (PD), similar to a tenant or VRF in networking.
 - Register memory blocks, defining size and access permissions.
 - Receive keys:
- L_Key (Local Key) for local access
 - R_Key (Remote Key) for remote write access
 
 
In our example:
- CCN memory → local read access
 - SCN memory → remote write access
 
  
    
  
Figure 2: Memory Allocation and Registration
Step 2: Queue Pair (QP) Creation #
- A Queue Pair (QP) = Send Queue + Receive Queue.
 - A Completion Queue (CQ) reports operation status.
 - Each QP is assigned a service type (Reliable Connection or Unreliable Datagram).
 
For reliable data transfer, we use RC (Reliable Connection).
- Bind QP to PD and memory region.
 - Assign a Partition Key (P_Key), similar to VXLAN VNI.
 
Example:
- CCN QP ID: 
0x12345678 - Associated P_Key: 
0x8012 
  
    
  
Figure 3: Queue Pair Creation
Step 3: RDMA Connection Initialization #
Connection setup involves REQ → REPLY → RTU messages:
- REQ (Request): CCN sends Local ID, QP number, P_Key, and PSN.
 - Reply: SCN responds with IDs, QP info, and PSN.
 - RTU (Ready to Use): CCN confirms connection.
 
At the end, the QP state transitions from INIT → Ready to Send → Ready to Receive.
  
    
  
Figure 4: RDMA Connection Initialization
Step 4: RDMA Write Operation #
Once connected, the CCN application issues a Work Request (WR) containing:
- OpCode: RDMA Write
 - Local buffer address + L_Key
 - Remote buffer address + R_Key
 - Payload length
 
The NIC then builds the required headers:
- InfiniBand Base Transport Header (IB BTH) – includes P_Key and QP ID
 - RDMA Extended Transport Header (RETH) – includes R_Key and data length
 - UDP header (port 4791) indicates IB BTH follows
 
Data is encapsulated and sent over Ethernet/IP/UDP/IB BTH/RETH.
  
    
  
Figure 5: Generating and Posting an RDMA Write Operation
On the SCN side:
- Validate P_Key and R_Key
 - Translate virtual to physical memory
 - Deliver data to the QP’s receive queue
 - Notify application via the completion queue
 
  
    
  
Figure 6: Receiving and Processing an RDMA Write Operation
Key Benefits of RDMA #
- 🚀 Ultra-low latency data transfer
 - ⚡ High bandwidth efficiency for HPC and AI
 - 🔒 Bypasses CPU/OS overhead for direct memory access
 - 📡 Lossless transport with PFC + ECN
 - 🔄 Scalable design for large clusters and data centers
 
Conclusion #
RDMA technology, especially with RoCEv2 over IP Fabrics, is essential for modern data-intensive workloads. By offloading memory operations from CPUs and enabling direct memory access across compute nodes, RDMA improves performance in AI training, big data analytics, distributed storage, and cloud-scale HPC systems.
Organizations adopting RDMA can expect:
- Faster application performance
 - Lower latency in AI/ML pipelines
 - Better efficiency in multi-node HPC environments
 
🔗 Original article: Detailed Explanation of the RDMA Working Process