RDMA Explained: How Remote Direct Memory Access Works

Table of Contents

Remote Direct Memory Access (RDMA) is a foundational technology in high-performance computing (HPC) and modern AI infrastructure, enabling extremely fast data movement between compute nodes with minimal CPU involvement. When implemented using RoCEv2 (RDMA over Converged Ethernet v2), it offers low latency, high throughput, and lossless transport across large-scale data center networks.

This guide provides a step-by-step walkthrough of how RDMA works — from memory registration to queue pairs, connection establishment, and RDMA write operations.

🚀 What is RDMA and Why It Matters?
#

RDMA allows one machine to read or write memory on another machine without involving the target CPU or OS kernel. This eliminates memory copies, reduces overhead, and achieves microsecond-level latency.

RDMA is essential for:

AI and ML training pipelines
HPC clusters and supercomputers
Distributed storage systems (Ceph, DAOS, NVMe-oF)
Real-time data analytics and streaming workloads

Because RoCEv2 runs over UDP, it depends on the underlying network to provide lossless delivery using:

Priority Flow Control (PFC) — prevents packet drops during congestion
Explicit Congestion Notification (ECN) — signals senders to reduce rate

🔧 RDMA Workflow Overview
#

A typical RDMA write from a Client Compute Node (CCN) to a Server Compute Node (SCN) involves four stages:

Memory allocation & registration
Queue Pair (QP) creation
Connection initialization
RDMA write operation

Each stage is explained in detail below.

🧩 Step 1: Memory Allocation and Registration
#

To ensure safe and controlled access, RDMA requires applications to register memory regions with the NIC.

Key steps:

Create a Protection Domain (PD)
Register memory regions with size and access rights
Obtain:
- L_Key (Local Key) — authorizes local NIC access
- R_Key (Remote Key) — authorizes remote RDMA write/read

In this example:

CCN memory → local read access
SCN memory → remote write access

🧱 Step 2: Queue Pair (QP) Creation
#

A Queue Pair contains:

Send Queue (SQ)
Receive Queue (RQ)
A shared Completion Queue (CQ) for reporting results

QP attributes:

Tied to PD and memory region
Uses service types (Reliable Connection commonly used)
Assigned a Partition Key (P_Key) — similar to a VXLAN VNI for isolation

Example values:

QP ID: 0x12345678
P_Key: 0x8012

🔗 Step 3: RDMA Connection Initialization
#

RDMA connection setup uses a three-message handshake:

REQ (Request)
REP (Reply)
RTU (Ready to Use)

The messages exchange:

Local IDs
QP numbers
P_Key
PSN (Packet Sequence Number)

After handshake:

QP transitions: INIT → RTR → RTS
- (Ready to Receive / Ready to Send)

📤 Step 4: Performing an RDMA Write
#

When the application wants to send data, it posts a Work Request (WR) to the send queue.

WR includes:

Operation code → RDMA Write
Local buffer address + L_Key
Remote buffer address + R_Key
Payload size

The NIC constructs protocol headers:

IB BTH (Base Transport Header) — includes P_Key and QP ID
RETH (RDMA Extended Transport Header) — includes R_Key & data length
UDP Header (port 4791)
Encapsulation: Ethernet → IP → UDP → IB BTH → RETH

On the SCN:

NIC validates R_Key & P_Key
Translates virtual to physical address
Writes data into memory
Posts a CQ event to notify completion

🌟 Benefits of RDMA
#

Ultra-low latency — microsecond-level transfers
High bandwidth — ideal for HPC and AI
Zero-copy architecture — bypasses the CPU
Lossless networking with PFC + ECN
Scales to large clusters and multi-node AI systems

🧭 Conclusion
#

RDMA — especially in the form of RoCEv2 over IP fabrics — is a cornerstone technology for high-performance, distributed systems. By enabling applications to directly access memory across nodes without CPU intervention, RDMA delivers: