Inside Meta’s 24K-GPU AI Superclusters

Table of Contents

Meta has revealed new details about its next-generation 24,576-GPU AI clusters, representing one of the largest and most open high-performance AI training infrastructures in the world. These clusters, built for training Llama 3 and future GenAI models, showcase Meta’s commitment to open hardware, open software, and industry-wide collaboration.

The systems are built on Grand Teton, OpenRack, and PyTorch, reinforcing Meta’s long-standing strategy of advancing AI through open compute ecosystems.

By the end of 2024, Meta expects its GPU fleet to scale to 350,000 NVIDIA H100 GPUs, equivalent to roughly 600,000 H100s of compute when including other accelerators—an unprecedented leap in AI infrastructure.

🚀 Why Meta Is Building Such Massive AI Clusters
#

Meta’s long-term objective is to develop open and responsible AGI. Achieving that vision requires the ability to train large, complex models at extreme scale. The new clusters support the entire range of Meta’s AI workloads—from Llama research to multimodal models, recommendation systems, and new AI-native devices.

Meta previously disclosed its AI Research SuperCluster (RSC) in 2022, a 16,000-GPU A100 system that powered early Llama development. The new clusters represent the next stage in scale, performance, and reliability.

🧩 Key Components of Meta’s 24,576-GPU AI Clusters
#

1. Networking: RoCE and InfiniBand at 400 Gbps
#

Meta built two versions of its new cluster:

A RoCE (RDMA over Converged Ethernet) version
using Arista 7800 plus OCP Wedge400 and Minipack2 switches
An NVIDIA Quantum2 InfiniBand version

Both interconnect GPUs at 400 Gbps, giving Meta the ability to directly compare scalability, resilience, and performance between Ethernet and InfiniBand fabrics at extreme scale.

Thanks to co-designed networking software and training stack workloads, both clusters are used successfully for large-scale AI training—including the current Llama 3 training run on the RoCE cluster.

2. Compute: Grand Teton, Meta’s Open GPU Platform
#

Each cluster is built using Grand Teton, Meta’s internally designed GPU system contributed to the Open Compute Project.

Key benefits of Grand Teton:

Integrated power, control, compute, and fabric interfaces
Strong thermal and signal-integrity performance
Rapid deployment and simplified maintenance
Flexible design for future GPUs, interconnects, and power architectures

Meta has been publicly designing open GPU platforms since Big Sur in 2015, and Grand Teton is its most advanced iteration.

3. Storage: Fábrica + Hammerspace for Extreme Scale
#

Training frontier AI models requires massive, fast, and resilient storage. Meta’s solution integrates:

Fábrica, Meta’s Flash-optimized distributed storage
A Parallel NFS deployment co-developed with Hammerspace
Local FUSE interfaces for checkpointing and data operations
YV3 Sierra Point OCP servers with high-capacity E1.S SSDs

This architecture allows:

Thousands of GPUs to load/save checkpoints simultaneously
Exabyte-scale data access
Interactive debugging across thousands of nodes
Flexible scaling for next-generation clusters

⚙️ Performance Optimization at Scale
#

Meta stresses that building massive clusters is only the first step—the real challenge is sustaining performance.

A few key improvements:

Topology-aware job scheduling
reduces congestion and latency.
Customized routing policies + NCCL optimizations
ensure efficient collective operations across thousands of GPUs.
FP8 and new parallelization strategies
enable faster and more efficient training on H100 GPUs.
Advanced debugging tools like
Design Debug and Distributed Collective Recorder
help identify bottlenecks down to a single GPU.

Startup times for large distributed jobs—once hours—have been reduced to minutes through PyTorch improvements.

🤝 Meta’s Commitment to Open AI and Open Hardware
#

Meta continues to champion open AI ecosystems:

Founding member of the Open Compute Project (OCP)
Largest contributor to PyTorch worldwide
Member of the AI Alliance
Creator of the Open Innovation AI Research Community

Meta believes that open-source tools enable transparency, trust, and shared progress—especially critical as AI systems grow more powerful.

⏭️ What’s Next for Meta’s AI Infrastructure?
#

Meta fully expects infrastructure requirements to expand again as models grow more sophisticated. As a result, the company continues to evaluate:

New interconnect fabrics
Advanced memory and storage solutions
Next-gen open GPU platforms
Larger-scale PyTorch optimizations
Future clusters involving hundreds of thousands of GPUs

Meta’s iterative philosophy remains the same: build, test, optimize, repeat—all in real production environments.

As AGI research accelerates globally, Meta’s massive open-architecture AI clusters will play a pivotal role in shaping the next decade of AI development.