LLM Training Storage Demands: Data & Checkpoints
Large language model (LLM) training involves two major storage consumers:
- Training datasets, which feed the model’s learning.
- Model checkpoints, which persist model state during long-running training.
This article explains how each category scales, where bottlenecks arise, and why checkpointing is a critical part of trillion-parameter LLM training.
📚 Training Data #
Transformer workloads are compute-heavy, not data-heavy. Tokenized datasets are surprisingly compact:
- English token ≈ 4 bytes
- 1 trillion tokens → only a few terabytes
SOTA models train on tens of trillions of tokens, resulting in tens of terabytes of processed text—large, but manageable compared to checkpoint sizes.
💾 Checkpoints: Scaling With Model Size #
Checkpoint size grows linearly with model parameters, not GPU count. A common rule of thumb:
16 bytes per parameter
(weights + gradients + optimizer state)
Modern training systems optimize checkpoint performance with multi-tier asynchronous pipelines:
- Copy GPU memory → host memory (brief stall).
- Resume training while CPU flushes to NVM/SSD.
- Background processes upload final data to distributed object storage.
ByteDance MegaScale and Microsoft Nebula implement such multi-layer designs for durability and speed.
⚡ Fast Checkpointing & Recovery (ByteDance MegaScale) #
Because large-scale training may run for weeks or months, checkpointing and recovery must be efficient.
Two-Phase MegaScale Design #
- Phase 1: Each GPU writes state to host memory in seconds.
- Phase 2: Background thread flushes data to HDFS asynchronously.
Optimized Recovery #
One GPU per data-parallel group loads a partitioned state from HDFS and broadcasts it to peers—greatly reducing read pressure and accelerating recovery.
🧱 The Checkpointing Challenge in LLMs #
As datasets reach petabyte scale and models reach hundreds of billions to trillions of parameters, parallelism becomes essential.
Full-Dimension Parallelism #
Efficient LLM training relies on combining:
- Data parallelism
- Model (tensor) parallelism
- Pipeline parallelism
For example, GPT-3 (175B parameters, ~500 GB) cannot fit on one GPU (80 GB): training on a single GPU would take 300+ years.
Blending all three parallel modes—as done by Stanford, NVIDIA, and Microsoft—enables practical throughput and memory efficiency.
🔐 Checkpoints for Recoverability #
Because training spans months:
- Failures must not lose progress
- Hyperparameters may need rollback
- Results require reproducibility
Training is GPU-bound, but checkpointing is I/O-bound:
- Writes dominate checkpoint creation
- Reads bottleneck recovery time
📦 Checkpoint Size = Model Size #
Checkpoint size is determined only by the number of parameters, not by dataset size, GPU count, or checkpoint frequency.
Example — GPT-3 (175B) #
- Model ≈ 500 GB
- Uses data + tensor + pipeline parallelism
- Only one pipeline group writes the checkpoint
This scales efficiently even to trillion-parameter models.
🧮 Mathematical Checkpoint Analysis #
For a 1T-parameter model:
- Checkpoint size: ~13.8 TB
- Write bandwidth: ~273 GB/s
- Latency: ~50 seconds (every 4 hours)
- Overhead: 0.3%
Recovery Requirements #
Read bandwidth must be much higher:
- Read bandwidth = data parallelism × write bandwidth
- With ×6 replication → 1.64 TB/s read bandwidth
Storage must sustain:
- ≈ 1.6 TB/s read
- ≈ 280 GB/s write
to hit the 50-second recovery target.
🧭 Common Misconceptions #
| Myth | Reality |
|---|---|
| “Each GPU needs 1 GB/s for checkpointing.” | Only pipeline groups write; per-GPU demand is much lower. |
| “Checkpoint size depends on dataset size.” | Checkpoints depend only on model parameters. |
🏁 Conclusion #
As LLMs enter the trillion-parameter era, storage architecture becomes just as important as computation.
Accurate checkpoint analysis—not rough estimates—helps avoid massive inefficiencies. With proper checkpointing design, training systems remain:
- Resilient
- Efficient
- Scalable
even under extreme model and dataset growth.