Skip to main content

RDMA Explained: How Remote Direct Memory Access Works

·658 words·4 mins
RDMA HPC RoCEv2 Networking
Table of Contents

Remote Direct Memory Access (RDMA) is a foundational technology in high-performance computing (HPC) and modern AI infrastructure, enabling extremely fast data movement between compute nodes with minimal CPU involvement. When implemented using RoCEv2 (RDMA over Converged Ethernet v2), it offers low latency, high throughput, and lossless transport across large-scale data center networks.

This guide provides a step-by-step walkthrough of how RDMA works — from memory registration to queue pairs, connection establishment, and RDMA write operations.

Remote Direct Memory Access
Figure 1: Fine-Grained DRAM High-Level Architecture


🚀 What is RDMA and Why It Matters?
#

RDMA allows one machine to read or write memory on another machine without involving the target CPU or OS kernel. This eliminates memory copies, reduces overhead, and achieves microsecond-level latency.

RDMA is essential for:

  • AI and ML training pipelines
  • HPC clusters and supercomputers
  • Distributed storage systems (Ceph, DAOS, NVMe-oF)
  • Real-time data analytics and streaming workloads

Because RoCEv2 runs over UDP, it depends on the underlying network to provide lossless delivery using:

  • Priority Flow Control (PFC) — prevents packet drops during congestion
  • Explicit Congestion Notification (ECN) — signals senders to reduce rate

🔧 RDMA Workflow Overview
#

A typical RDMA write from a Client Compute Node (CCN) to a Server Compute Node (SCN) involves four stages:

  1. Memory allocation & registration
  2. Queue Pair (QP) creation
  3. Connection initialization
  4. RDMA write operation

Each stage is explained in detail below.


🧩 Step 1: Memory Allocation and Registration
#

To ensure safe and controlled access, RDMA requires applications to register memory regions with the NIC.

Key steps:

  • Create a Protection Domain (PD)
  • Register memory regions with size and access rights
  • Obtain:
    • L_Key (Local Key) — authorizes local NIC access
    • R_Key (Remote Key) — authorizes remote RDMA write/read

In this example:

  • CCN memory → local read access
  • SCN memory → remote write access

Remote Direct Memory Access
Figure 2: Memory Allocation and Registration


🧱 Step 2: Queue Pair (QP) Creation
#

A Queue Pair contains:

  • Send Queue (SQ)
  • Receive Queue (RQ)
  • A shared Completion Queue (CQ) for reporting results

QP attributes:

  • Tied to PD and memory region
  • Uses service types (Reliable Connection commonly used)
  • Assigned a Partition Key (P_Key) — similar to a VXLAN VNI for isolation

Example values:

  • QP ID: 0x12345678
  • P_Key: 0x8012

Remote Direct Memory Access
Figure 3: Queue Pair Creation


🔗 Step 3: RDMA Connection Initialization
#

RDMA connection setup uses a three-message handshake:

  1. REQ (Request)
  2. REP (Reply)
  3. RTU (Ready to Use)

The messages exchange:

  • Local IDs
  • QP numbers
  • P_Key
  • PSN (Packet Sequence Number)

After handshake:

  • QP transitions: INIT → RTR → RTS
    • (Ready to Receive / Ready to Send)

Remote Direct Memory Access
Figure 4: RDMA Connection Initialization


📤 Step 4: Performing an RDMA Write
#

When the application wants to send data, it posts a Work Request (WR) to the send queue.

WR includes:

  • Operation code → RDMA Write
  • Local buffer address + L_Key
  • Remote buffer address + R_Key
  • Payload size

The NIC constructs protocol headers:

  • IB BTH (Base Transport Header) — includes P_Key and QP ID
  • RETH (RDMA Extended Transport Header) — includes R_Key & data length
  • UDP Header (port 4791)
  • Encapsulation: Ethernet → IP → UDP → IB BTH → RETH

Remote Direct Memory Access
Figure 5: Posting an RDMA Write Operation

On the SCN:

  • NIC validates R_Key & P_Key
  • Translates virtual to physical address
  • Writes data into memory
  • Posts a CQ event to notify completion

Remote Direct Memory Access
Figure 6: Receiving and Processing an RDMA Write


🌟 Benefits of RDMA
#

  • Ultra-low latency — microsecond-level transfers
  • High bandwidth — ideal for HPC and AI
  • Zero-copy architecture — bypasses the CPU
  • Lossless networking with PFC + ECN
  • Scales to large clusters and multi-node AI systems

🧭 Conclusion
#

RDMA — especially in the form of RoCEv2 over IP fabrics — is a cornerstone technology for high-performance, distributed systems. By enabling applications to directly access memory across nodes without CPU intervention, RDMA delivers:

  • Faster AI/ML training
  • Lower latency in HPC workloads
  • Higher throughput for storage and analytics
  • Efficient scaling across modern data centers

For organizations building next-generation compute clusters, RDMA provides an essential foundation for performance and scalability.

Related

TSMC Plans 10% Chip Price Hike Amid Rising Costs and Soaring Demand
·626 words·3 mins
TSMC Semiconductors Chip Prices AI HPC Foundry 3nm 5nm
UCIe 3.0 Introduces DSP Support and Doubles Bandwidth for AI and HPC
·465 words·3 mins
UCIe Chiplets DSP Semiconductors AI HPC
AI Factories Drive Global Battle Over Supercomputing Interconnects
·642 words·4 mins
AI Factories Supercomputing InfiniBand Ultra Ethernet Huawei UnifiedBus HPC