LLM Training Storage Demands: Data & Checkpoints

Table of Contents

LLM Training Storage Demands: Data & Checkpoints

Large language model (LLM) training involves two major storage consumers:

Training datasets, which feed the model’s learning.
Model checkpoints, which persist model state during long-running training.

This article explains how each category scales, where bottlenecks arise, and why checkpointing is a critical part of trillion-parameter LLM training.

📚 Training Data
#

Transformer workloads are compute-heavy, not data-heavy. Tokenized datasets are surprisingly compact:

English token ≈ 4 bytes
1 trillion tokens → only a few terabytes

SOTA models train on tens of trillions of tokens, resulting in tens of terabytes of processed text—large, but manageable compared to checkpoint sizes.

💾 Checkpoints: Scaling With Model Size
#

Checkpoint size grows linearly with model parameters, not GPU count. A common rule of thumb:

16 bytes per parameter
(weights + gradients + optimizer state)

Modern training systems optimize checkpoint performance with multi-tier asynchronous pipelines:

Copy GPU memory → host memory (brief stall).
Resume training while CPU flushes to NVM/SSD.
Background processes upload final data to distributed object storage.

ByteDance MegaScale and Microsoft Nebula implement such multi-layer designs for durability and speed.

⚡ Fast Checkpointing & Recovery (ByteDance MegaScale)
#

Because large-scale training may run for weeks or months, checkpointing and recovery must be efficient.

Two-Phase MegaScale Design
#

Phase 1: Each GPU writes state to host memory in seconds.
Phase 2: Background thread flushes data to HDFS asynchronously.

Optimized Recovery
#

One GPU per data-parallel group loads a partitioned state from HDFS and broadcasts it to peers—greatly reducing read pressure and accelerating recovery.

🧱 The Checkpointing Challenge in LLMs
#

As datasets reach petabyte scale and models reach hundreds of billions to trillions of parameters, parallelism becomes essential.

Full-Dimension Parallelism
#

Efficient LLM training relies on combining:

Data parallelism
Model (tensor) parallelism
Pipeline parallelism

For example, GPT-3 (175B parameters, ~500 GB) cannot fit on one GPU (80 GB): training on a single GPU would take 300+ years.

Blending all three parallel modes—as done by Stanford, NVIDIA, and Microsoft—enables practical throughput and memory efficiency.

🔐 Checkpoints for Recoverability
#

Because training spans months:

Failures must not lose progress
Hyperparameters may need rollback
Results require reproducibility

Training is GPU-bound, but checkpointing is I/O-bound:

Writes dominate checkpoint creation
Reads bottleneck recovery time

📦 Checkpoint Size = Model Size
#

Checkpoint size is determined only by the number of parameters, not by dataset size, GPU count, or checkpoint frequency.

Example — GPT-3 (175B)
#

Model ≈ 500 GB
Uses data + tensor + pipeline parallelism
Only one pipeline group writes the checkpoint

This scales efficiently even to trillion-parameter models.

🧮 Mathematical Checkpoint Analysis
#

For a 1T-parameter model:

Checkpoint size: ~13.8 TB
Write bandwidth: ~273 GB/s
Latency: ~50 seconds (every 4 hours)
Overhead: 0.3%

Recovery Requirements
#

Read bandwidth must be much higher:

Read bandwidth = data parallelism × write bandwidth
With ×6 replication → 1.64 TB/s read bandwidth

Storage must sustain:

≈ 1.6 TB/s read
≈ 280 GB/s write

to hit the 50-second recovery target.

🧭 Common Misconceptions
#

Myth	Reality
“Each GPU needs 1 GB/s for checkpointing.”	Only pipeline groups write; per-GPU demand is much lower.
“Checkpoint size depends on dataset size.”	Checkpoints depend only on model parameters.