Beluga: CXL-Based KV Cache Architecture Cuts TTFT by 89.6%
As Large Language Models (LLMs) scale and long-context inference becomes standard, memory capacity and latency have emerged as critical bottlenecks in GPU-accelerated serving. While GPU HBM is fast, its limited capacity forces systems to store large KV Caches in CPU DRAM—a resource constrained by the number of memory channels per socket.
To extend memory capacity, many serving systems adopt RDMA-based disaggregated memory pools, but this introduces new challenges: high latency, complex communication paths, and heavy synchronization overheads.
Alibaba Cloud proposes Beluga, a CXL-based shared memory architecture that enables direct GPU access to a large-scale memory pool via a CXL switch. Built atop this architecture, Beluga-KVCache dramatically accelerates large-scale KV Cache operations for LLM inference. Compared with the RDMA-based MoonCake, Beluga-KVCache:
- Reduces Time-to-First-Token (TTFT) by 89.6%
- Improves vLLM throughput by 7.35×
Beluga is the first system enabling GPU direct access to large pooled memory through a CXL switch—an important step toward low-latency access to massive shared memory.
🧩 Beluga Architecture Overview #
Beluga uses CXL switches to build a scalable, shared memory pool accessible directly by both CPUs and GPUs via load/store operations. This approach removes the complex protocol and synchronization overhead typical of RDMA.
Hardware Deployment and Connectivity #
Beluga replaces four RDMA NICs with two PCIe/CXL adapters:
- Two-socket NUMA servers connect to the CXL switch via PCIe 5.0 ×16.
- The shared CXL memory pool includes a switch node plus a memory box.
- The CXL switch uses dual XConn XC50256 chips, supporting
2 TB/s forwarding bandwidth with 256 lanes of PCIe 5.0. - Up to 16 servers can attach, forming an 8 TB pooled memory cluster with 1 TB/s aggregate bandwidth.
This design enables efficient, concurrent multi-host access through hardware-controlled address mapping and forwarding.
Advantages Over RDMA #
Switching from RDMA’s network protocol to CXL’s memory-semantic interface brings major performance and simplicity benefits.
Performance Improvements #
| Component | Beluga Access Method | Advantage |
|---|---|---|
| CPU | Direct load/store; DMA via Intel DSA | Removes multi-stage RDMA data paths and bounce buffers |
| GPU | cudaMemcpy P2P transfers; custom CUDA kernels |
Eliminates RDMA’s cross-component sync and polling |
Beluga’s direct memory access model reduces latency and eliminates the need for CPU-driven or GPU-driven RDMA communication paths.
System Simplification #
Beluga simplifies system design in three key ways:
- Programming Model: Access resembles local DRAM—no RDMA verbs, queue pairs, or network stack.
- Memory Management: A unified address space enables hosts to manage CXL memory in DAX mode via
mmap(). - Hardware Cost: CXL components are cheaper and avoid the over-provisioning requirements of high-end RDMA NICs.
Beluga-KVCache Integration #
Beluga integrates into vLLM to optimize large-scale KV Cache handling:
- A pooled memory region for KV Cache blocks
- A global index mapping token blocks to physical addresses
- A scheduler distributing requests among LLM instances
Beluga-KVCache improves this pipeline by:
- Using CXL load/store for KV Cache reads/writes
- Replacing RDMA-based index service communication with CXL RPC
- Simplifying scheduling thanks to flatter memory hierarchy
📊 Evaluation Results #
Beluga-KVCache consistently outperforms MoonCake across all scenarios.
| Metric | Cache-Populate | Cache-Hit |
|---|---|---|
| TTFT | 12.4% lower | 89.6% lower |
| Throughput (QPS) | 21.5% higher | 7.35× higher |
Latency and Throughput (TTFT / TPOT) #
In cache-hit runs (second-run inference), KV Cache read latency dominates. Beluga’s CXL semantics allow faster access than MoonCake’s RDMA protocol:
- Lower TTFT
- Lower TPOT
- Elimination of cross-device synchronization penalties
Impact of Context Length #
Beluga’s advantage grows as input sequence length increases:
- At 8K tokens, KV Cache operations dominate total latency.
- Beluga-KVCache consistently delivers lower average and P99 TTFT.
Deployment: Prefill–Decode Architecture #
In the common Prefill–Decode decoupled setup:
- Beluga accelerates KV Cache load/store paths, achieving 3.41×–9.47× higher QPS.
- MoonCake struggles when the KV Cache block size is small (e.g., 16 tokens), causing cache-hit TTFT to spike to 76.8 seconds.
- Beluga directly uses vLLM’s native 16-token blocks—with no batching—maintaining low latency.
🚀 Conclusion #
Beluga and Beluga-KVCache demonstrate how CXL memory semantics can fundamentally reshape LLM serving architecture:
- Massive shared memory pools
- Uniform CPU/GPU access
- Load/store simplicity instead of RDMA complexity
- Dramatically reduced TTFT
- Major throughput improvements
Beluga shows that moving beyond RDMA to CXL-based memory pooling is a powerful architectural shift—one that may define the next generation of high-performance LLM serving infrastructure.