Skip to main content

LLM Training Storage Demands: Data & Checkpoints

·571 words·3 mins
LLM AI Training Storage
Table of Contents

LLM Training Storage Demands: Data & Checkpoints

Large language model (LLM) training involves two major storage consumers:

  • Training datasets, which feed the model’s learning.
  • Model checkpoints, which persist model state during long-running training.

This article explains how each category scales, where bottlenecks arise, and why checkpointing is a critical part of trillion-parameter LLM training.


📚 Training Data
#

Transformer workloads are compute-heavy, not data-heavy. Tokenized datasets are surprisingly compact:

  • English token ≈ 4 bytes
  • 1 trillion tokens → only a few terabytes

SOTA models train on tens of trillions of tokens, resulting in tens of terabytes of processed text—large, but manageable compared to checkpoint sizes.


💾 Checkpoints: Scaling With Model Size
#

Checkpoint size grows linearly with model parameters, not GPU count. A common rule of thumb:

16 bytes per parameter
(weights + gradients + optimizer state)

Modern training systems optimize checkpoint performance with multi-tier asynchronous pipelines:

  1. Copy GPU memory → host memory (brief stall).
  2. Resume training while CPU flushes to NVM/SSD.
  3. Background processes upload final data to distributed object storage.

ByteDance MegaScale and Microsoft Nebula implement such multi-layer designs for durability and speed.


⚡ Fast Checkpointing & Recovery (ByteDance MegaScale)
#

Because large-scale training may run for weeks or months, checkpointing and recovery must be efficient.

Two-Phase MegaScale Design
#

  • Phase 1: Each GPU writes state to host memory in seconds.
  • Phase 2: Background thread flushes data to HDFS asynchronously.

Optimized Recovery
#

One GPU per data-parallel group loads a partitioned state from HDFS and broadcasts it to peers—greatly reducing read pressure and accelerating recovery.


🧱 The Checkpointing Challenge in LLMs
#

As datasets reach petabyte scale and models reach hundreds of billions to trillions of parameters, parallelism becomes essential.

Full-Dimension Parallelism
#

Efficient LLM training relies on combining:

  • Data parallelism
  • Model (tensor) parallelism
  • Pipeline parallelism

For example, GPT-3 (175B parameters, ~500 GB) cannot fit on one GPU (80 GB): training on a single GPU would take 300+ years.

Blending all three parallel modes—as done by Stanford, NVIDIA, and Microsoft—enables practical throughput and memory efficiency.


🔐 Checkpoints for Recoverability
#

Because training spans months:

  • Failures must not lose progress
  • Hyperparameters may need rollback
  • Results require reproducibility

Training is GPU-bound, but checkpointing is I/O-bound:

  • Writes dominate checkpoint creation
  • Reads bottleneck recovery time

📦 Checkpoint Size = Model Size
#

Checkpoint size is determined only by the number of parameters, not by dataset size, GPU count, or checkpoint frequency.

Example — GPT-3 (175B)
#

  • Model ≈ 500 GB
  • Uses data + tensor + pipeline parallelism
  • Only one pipeline group writes the checkpoint

This scales efficiently even to trillion-parameter models.


🧮 Mathematical Checkpoint Analysis
#

For a 1T-parameter model:

  • Checkpoint size: ~13.8 TB
  • Write bandwidth: ~273 GB/s
  • Latency: ~50 seconds (every 4 hours)
  • Overhead: 0.3%

Recovery Requirements
#

Read bandwidth must be much higher:

  • Read bandwidth = data parallelism × write bandwidth
  • With ×6 replication → 1.64 TB/s read bandwidth

Storage must sustain:

  • ≈ 1.6 TB/s read
  • ≈ 280 GB/s write

to hit the 50-second recovery target.


🧭 Common Misconceptions
#

Myth Reality
“Each GPU needs 1 GB/s for checkpointing.” Only pipeline groups write; per-GPU demand is much lower.
“Checkpoint size depends on dataset size.” Checkpoints depend only on model parameters.

🏁 Conclusion
#

As LLMs enter the trillion-parameter era, storage architecture becomes just as important as computation.

Accurate checkpoint analysis—not rough estimates—helps avoid massive inefficiencies. With proper checkpointing design, training systems remain:

  • Resilient
  • Efficient
  • Scalable

even under extreme model and dataset growth.

Related

4 Predictions for Data Management and Storage in 2024
·618 words·3 mins
Storage Data Management AI Cloud IT Strategy
Essential LLM Terms Explained
·660 words·4 mins
AI LLM Terminology Deep Learning
AMD Acquires Silo AI for $665 Million to Strengthen AI Capabilities
·395 words·2 mins
AMD Silo AI Acquisition AI