Beluga: CXL-Based KV Cache Architecture Cuts TTFT by 89.6%

Table of Contents

Beluga: CXL-Based KV Cache Architecture Cuts TTFT by 89.6%

As Large Language Models (LLMs) scale and long-context inference becomes standard, memory capacity and latency have emerged as critical bottlenecks in GPU-accelerated serving. While GPU HBM is fast, its limited capacity forces systems to store large KV Caches in CPU DRAM—a resource constrained by the number of memory channels per socket.

To extend memory capacity, many serving systems adopt RDMA-based disaggregated memory pools, but this introduces new challenges: high latency, complex communication paths, and heavy synchronization overheads.

Alibaba Cloud proposes Beluga, a CXL-based shared memory architecture that enables direct GPU access to a large-scale memory pool via a CXL switch. Built atop this architecture, Beluga-KVCache dramatically accelerates large-scale KV Cache operations for LLM inference. Compared with the RDMA-based MoonCake, Beluga-KVCache:

Reduces Time-to-First-Token (TTFT) by 89.6%
Improves vLLM throughput by 7.35×

Beluga is the first system enabling GPU direct access to large pooled memory through a CXL switch—an important step toward low-latency access to massive shared memory.

🧩 Beluga Architecture Overview
#

Beluga uses CXL switches to build a scalable, shared memory pool accessible directly by both CPUs and GPUs via load/store operations. This approach removes the complex protocol and synchronization overhead typical of RDMA.

Hardware Deployment and Connectivity
#

Beluga replaces four RDMA NICs with two PCIe/CXL adapters:

Two-socket NUMA servers connect to the CXL switch via PCIe 5.0 ×16.
The shared CXL memory pool includes a switch node plus a memory box.
The CXL switch uses dual XConn XC50256 chips, supporting
2 TB/s forwarding bandwidth with 256 lanes of PCIe 5.0.
Up to 16 servers can attach, forming an 8 TB pooled memory cluster with 1 TB/s aggregate bandwidth.

This design enables efficient, concurrent multi-host access through hardware-controlled address mapping and forwarding.

Advantages Over RDMA
#

Switching from RDMA’s network protocol to CXL’s memory-semantic interface brings major performance and simplicity benefits.

Performance Improvements
#

Component	Beluga Access Method	Advantage
CPU	Direct load/store; DMA via Intel DSA	Removes multi-stage RDMA data paths and bounce buffers
GPU	`cudaMemcpy` P2P transfers; custom CUDA kernels	Eliminates RDMA’s cross-component sync and polling

Beluga’s direct memory access model reduces latency and eliminates the need for CPU-driven or GPU-driven RDMA communication paths.

System Simplification
#

Beluga simplifies system design in three key ways:

Programming Model: Access resembles local DRAM—no RDMA verbs, queue pairs, or network stack.
Memory Management: A unified address space enables hosts to manage CXL memory in DAX mode via mmap().
Hardware Cost: CXL components are cheaper and avoid the over-provisioning requirements of high-end RDMA NICs.

Beluga-KVCache Integration
#

Beluga integrates into vLLM to optimize large-scale KV Cache handling:

A pooled memory region for KV Cache blocks
A global index mapping token blocks to physical addresses
A scheduler distributing requests among LLM instances

Beluga-KVCache improves this pipeline by:

Using CXL load/store for KV Cache reads/writes
Replacing RDMA-based index service communication with CXL RPC
Simplifying scheduling thanks to flatter memory hierarchy

📊 Evaluation Results
#

Beluga-KVCache consistently outperforms MoonCake across all scenarios.

Metric	Cache-Populate	Cache-Hit
TTFT	12.4% lower	89.6% lower
Throughput (QPS)	21.5% higher	7.35× higher

Latency and Throughput (TTFT / TPOT)
#

In cache-hit runs (second-run inference), KV Cache read latency dominates. Beluga’s CXL semantics allow faster access than MoonCake’s RDMA protocol:

Lower TTFT
Lower TPOT
Elimination of cross-device synchronization penalties

Impact of Context Length
#

Beluga’s advantage grows as input sequence length increases:

At 8K tokens, KV Cache operations dominate total latency.
Beluga-KVCache consistently delivers lower average and P99 TTFT.

Deployment: Prefill–Decode Architecture
#

In the common Prefill–Decode decoupled setup:

Beluga accelerates KV Cache load/store paths, achieving 3.41×–9.47× higher QPS.
MoonCake struggles when the KV Cache block size is small (e.g., 16 tokens), causing cache-hit TTFT to spike to 76.8 seconds.
Beluga directly uses vLLM’s native 16-token blocks—with no batching—maintaining low latency.

🚀 Conclusion
#

Beluga and Beluga-KVCache demonstrate how CXL memory semantics can fundamentally reshape LLM serving architecture:

Massive shared memory pools
Uniform CPU/GPU access
Load/store simplicity instead of RDMA complexity
Dramatically reduced TTFT
Major throughput improvements

Beluga shows that moving beyond RDMA to CXL-based memory pooling is a powerful architectural shift—one that may define the next generation of high-performance LLM serving infrastructure.