HBF: The Next Memory Layer for AI Accelerators

Table of Contents

HBF: The Next Memory Layer for AI Accelerators

The explosive growth of AI workloads is pushing memory architectures to their limits. As models scale and inference workloads demand ever-larger datasets, traditional memory solutions are struggling to keep pace.

Industry leaders—including NVIDIA, AMD, and Google—are exploring a new approach: HBF (High-Bandwidth Flash). This technology introduces a new memory tier designed to complement High-Bandwidth Memory (HBM) with dramatically larger capacity while maintaining relatively high data throughput.

🧩 The HBM–HBF Tiered Memory Architecture
#

HBM (High-Bandwidth Memory) currently acts as the ultra-fast working memory for GPUs and AI accelerators. It provides extremely high bandwidth for operations such as reading and processing KV (Key–Value) cache data in large language models.

However, HBM has two major limitations:

High cost
Limited capacity

HBF aims to solve this by introducing a high-capacity flash-based memory tier directly connected to the accelerator.

A Simple Analogy
#

Professor Kim Joungho of KAIST compares the relationship between HBM and HBF to a library system:

HBM: Like a small bookshelf at home—very fast and easy to access.
HBF: Like a massive library—slightly slower, but able to store vastly more information.

In practical terms:

HBF capacity: ~10× larger than HBM
HBF speed: Lower than DRAM but significantly faster than traditional storage systems

This layered architecture enables accelerators to balance speed and scale.

⚙️ Technical Design and Performance
#

HBF adopts a stacked architecture similar to HBM but uses 3D NAND flash instead of DRAM.

Multiple NAND layers are vertically integrated and connected using TSV (Through-Silicon Via) technology.

Key characteristics include:

Stacked 3D NAND layers
TSV vertical interconnects
Integrated logic die at the base

Performance Characteristics
#

Typical design targets include:

Capacity: Up to 512 GB per HBF unit
Bandwidth: Up to 1.638 TB/s
Form factor: Directly integrated near the accelerator

This bandwidth dramatically exceeds conventional SSD performance.

For comparison:

Storage Type	Typical Bandwidth
PCIe 4.0 NVMe SSD	~7 GB/s
PCIe 5.0 NVMe SSD	~14 GB/s
HBF Stack	Up to ~1.6 TB/s

Major manufacturers—including SK hynix and SanDisk—have demonstrated prototype designs where NAND stacks connect to a base logic die to create a fully integrated storage module.

🔁 A Shift Toward Read-Centric Software
#

Because HBF relies on flash memory, it introduces a new constraint: limited write endurance.

Typical HBF flash cells support approximately:

~100,000 write cycles

However:

Read operations have effectively no limit

Implications for AI Software
#

This endurance constraint requires AI frameworks to adapt.

Future accelerator software will likely adopt a read-heavy memory model, where:

Model parameters and KV caches are frequently read
Writes are minimized and carefully managed

For example, during inference:

The accelerator retrieves KV cache data from HBM or HBF.
The model processes tokens sequentially.
Output tokens are generated word-by-word.

Because inference workloads are naturally read-dominant, they align well with HBF’s characteristics.

🚀 Future Roadmap: HBM6 and Beyond
#

HBF is expected to enter the market alongside next-generation HBM technologies.

HBM6 Era Integration
#

During the HBM6 generation:

Multiple HBM stacks will form the high-speed compute memory layer.
HBF modules will provide large-scale storage close to the accelerator.

This creates a multi-tier accelerator memory hierarchy.

Toward the “Storage Factory”
#

Future generations (often referred to conceptually as HBM7) envision a system where accelerators access a distributed storage pool sometimes described as a “Storage Factory.”

In such architectures:

Data could be processed directly within storage modules
Intermediate storage networks may be bypassed
Latency between compute and data could shrink dramatically

Early Industry Milestones
#

Kioxia has already demonstrated a 5 TB HBF prototype module using:

PCIe Gen6 x8
64 Gbps transfer rates

These early prototypes hint at the massive scale possible for future AI memory systems.

🏭 Manufacturing Challenges
#

Building HBF stacks is technically complex.

Key manufacturing challenges include:

Wafer warpage control at the base die
Micro-bump interconnect density
Thermal management for dense NAND stacks
TSV reliability across multiple layers

As NAND layer counts increase, maintaining mechanical stability and yield becomes increasingly difficult.

📈 Market Outlook
#

Major semiconductor manufacturers are moving quickly to commercialize HBF.

Current projections suggest:

24-month development window for early integration
Initial deployment in AI accelerators from companies such as NVIDIA, AMD, and Google
Potential introduction alongside next-generation HBM platforms

Long-term forecasts are even more ambitious.

Professor Kim Joungho predicts that by 2038, the HBF market could surpass the HBM market, as capacity becomes the dominant constraint in scaling AI systems.