🚀 The HBM Bottleneck in Modern GPU Servers #
High-bandwidth memory (HBM) has become the primary performance limiter for large-scale AI inference. As large language models push toward longer context windows and higher concurrency, even the most advanced GPUs quickly run out of on-package memory.
Once a model’s context window exceeds available HBM, inference performance collapses. Key-value (KV) cache entries are evicted, recomputed, and reloaded repeatedly, causing latency spikes and sharply reduced throughput.
Pliops is targeting this exact pain point.
đź§ FusIOnX and the XDP LightningAI Card #
Pliops has unveiled FusIOnX, an end-to-end inference acceleration stack built around its XDP LightningAI PCIe card. Rather than replacing GPUs or requiring proprietary memory fabrics, LightningAI introduces a sub-HBM memory tier using ultra-fast NVMe and RDMA-accessed SSDs.
The card is powered by a purpose-built ASIC and software stack that transparently augments GPU memory for inference workloads.
At its core, LightningAI functions as:
- A key-value cache extension for LLM inference
- A shared memory tier across one or more GPU servers
- A vendor-agnostic accelerator, independent of GPU supplier or storage backend
⚡ Accelerating vLLM and Nvidia Dynamo #
Modern inference frameworks rely heavily on KV caching. UC Berkeley’s vLLM, widely used for high-throughput serving, stores intermediate attention states to avoid recomputation. Nvidia’s Dynamo framework orchestrates inference engines such as TensorRT-LLM and vLLM for optimal scheduling.
Pliops integrates directly into this software layer.
Key results claimed by Pliops: #
- 2.5Ă— higher requests per second for standard vLLM production stacks
- Up to 8Ă— faster end-to-end inference in memory-constrained scenarios
- Reduced latency growth as context windows scale
By storing already-computed KV cache entries on fast SSDs instead of recomputing them after eviction, FusIOnX allows inference performance to scale beyond native HBM capacity—without adding more GPUs.
đź§© FusIOnX Stack Variants #
Pliops positions FusIOnX as “AI stack glue,” offering multiple deployment models:
FusIOnX vLLM Production Stack #
- vLLM KV-cache acceleration
- Smart request routing across multiple GPU nodes
- Full upstream vLLM compatibility
FusIOnX vLLM + Dynamo + SGLang #
- Integrated KV-cache acceleration
- Support for prefill/decode node separation
- Single-node and multi-node configurations
FusIOnX KVIO #
- Distributed Key-Value I/O over the network
- Serves any GPU within a server
- Planned support for RAG and vector databases on CPU servers
FusIOnX KV Store #
- XDP AccelKV distributed key-value store
- RAIDplus self-healing storage
- Designed for long-term LLM memory persistence
This architecture is particularly well-suited for emerging models with persistent memory concepts, such as Google’s Titans.
đź§± Deployment Models: Disaggregated or Hyperconverged #
The XDP LightningAI card can be deployed in two primary ways:
-
Disaggregated Mode
Accelerates GPU servers connected to external storage arrays or shared data pools. -
Hyperconverged “LLM-in-a-Box” Mode
Installed directly inside a GPU server with 24 SSD slots, providing both storage and inference acceleration in a single system.
Pliops demonstrates this configuration in a 2RU Dell server, positioning it as a turnkey inference appliance.
🔮 What’s Coming Next #
Pliops is expanding FusIOnX beyond LLM inference:
-
FusIOnX RAG & Vector Databases
Proof-of-concept stage, targeting accelerated index build and retrieval. -
FusIOnX GNN
Designed to store and retrieve node embeddings for large-scale graph neural networks. -
FusIOnX DLRM
Focused on deep learning recommendation models, enabling TB–PB scale embedding access with simplified storage pipelines.
🆚 Competitive Landscape #
Pliops is not alone in tackling memory pressure for AI workloads. Competing approaches include:
- GridGain — Distributed in-memory data grids for AI and RAG pipelines
- Hammerspace Tier Zero — Unified data access across memory tiers
- WEKA Augmented Memory Grid — Memory pooling with GPUDirect support
- VAST Data VUA — Unified storage and memory abstraction for GPU workloads
What differentiates Pliops is its PCIe add-in card approach, offering memory-tier expansion without forcing architectural lock-in or cluster-wide reconfiguration.
đź’ˇ Final Takeaway #
As LLM context windows grow faster than GPU HBM capacity, inference is becoming a memory problem, not a compute problem.
Pliops’ FusIOnX and XDP LightningAI card address this gap by inserting a fast, scalable memory tier beneath GPUs—reducing recomputation, stabilizing latency, and extending the useful life of existing GPU infrastructure.
For operators running large-scale inference, this approach may prove cheaper—and nearly as fast—as simply buying more GPUs.