Skip to main content

Inside Meta’s 24K-GPU AI Superclusters

·686 words·4 mins
Meta GenAI Infrastructure Supercluster Open Compute
Table of Contents

Meta has revealed new details about its next-generation 24,576-GPU AI clusters, representing one of the largest and most open high-performance AI training infrastructures in the world. These clusters, built for training Llama 3 and future GenAI models, showcase Meta’s commitment to open hardware, open software, and industry-wide collaboration.

The systems are built on Grand Teton, OpenRack, and PyTorch, reinforcing Meta’s long-standing strategy of advancing AI through open compute ecosystems.

By the end of 2024, Meta expects its GPU fleet to scale to 350,000 NVIDIA H100 GPUs, equivalent to roughly 600,000 H100s of compute when including other accelerators—an unprecedented leap in AI infrastructure.

Meta GenAI Model

🚀 Why Meta Is Building Such Massive AI Clusters
#

Meta’s long-term objective is to develop open and responsible AGI. Achieving that vision requires the ability to train large, complex models at extreme scale. The new clusters support the entire range of Meta’s AI workloads—from Llama research to multimodal models, recommendation systems, and new AI-native devices.

Meta previously disclosed its AI Research SuperCluster (RSC) in 2022, a 16,000-GPU A100 system that powered early Llama development. The new clusters represent the next stage in scale, performance, and reliability.

🧩 Key Components of Meta’s 24,576-GPU AI Clusters
#

1. Networking: RoCE and InfiniBand at 400 Gbps
#

Meta built two versions of its new cluster:

  • A RoCE (RDMA over Converged Ethernet) version
    using Arista 7800 plus OCP Wedge400 and Minipack2 switches
  • An NVIDIA Quantum2 InfiniBand version

Both interconnect GPUs at 400 Gbps, giving Meta the ability to directly compare scalability, resilience, and performance between Ethernet and InfiniBand fabrics at extreme scale.

Thanks to co-designed networking software and training stack workloads, both clusters are used successfully for large-scale AI training—including the current Llama 3 training run on the RoCE cluster.

2. Compute: Grand Teton, Meta’s Open GPU Platform
#

Each cluster is built using Grand Teton, Meta’s internally designed GPU system contributed to the Open Compute Project.

Key benefits of Grand Teton:

  • Integrated power, control, compute, and fabric interfaces
  • Strong thermal and signal-integrity performance
  • Rapid deployment and simplified maintenance
  • Flexible design for future GPUs, interconnects, and power architectures

Meta has been publicly designing open GPU platforms since Big Sur in 2015, and Grand Teton is its most advanced iteration.

3. Storage: Fábrica + Hammerspace for Extreme Scale
#

Training frontier AI models requires massive, fast, and resilient storage. Meta’s solution integrates:

  • Fábrica, Meta’s Flash-optimized distributed storage
  • A Parallel NFS deployment co-developed with Hammerspace
  • Local FUSE interfaces for checkpointing and data operations
  • YV3 Sierra Point OCP servers with high-capacity E1.S SSDs

This architecture allows:

  • Thousands of GPUs to load/save checkpoints simultaneously
  • Exabyte-scale data access
  • Interactive debugging across thousands of nodes
  • Flexible scaling for next-generation clusters

⚙️ Performance Optimization at Scale
#

Meta stresses that building massive clusters is only the first step—the real challenge is sustaining performance.

A few key improvements:

  • Topology-aware job scheduling
    reduces congestion and latency.
  • Customized routing policies + NCCL optimizations
    ensure efficient collective operations across thousands of GPUs.
  • FP8 and new parallelization strategies
    enable faster and more efficient training on H100 GPUs.
  • Advanced debugging tools like
    Design Debug and Distributed Collective Recorder
    help identify bottlenecks down to a single GPU.

Startup times for large distributed jobs—once hours—have been reduced to minutes through PyTorch improvements.

Meta GenAI Performance

🤝 Meta’s Commitment to Open AI and Open Hardware
#

Meta continues to champion open AI ecosystems:

  • Founding member of the Open Compute Project (OCP)
  • Largest contributor to PyTorch worldwide
  • Member of the AI Alliance
  • Creator of the Open Innovation AI Research Community

Meta believes that open-source tools enable transparency, trust, and shared progress—especially critical as AI systems grow more powerful.

⏭️ What’s Next for Meta’s AI Infrastructure?
#

Meta fully expects infrastructure requirements to expand again as models grow more sophisticated. As a result, the company continues to evaluate:

  • New interconnect fabrics
  • Advanced memory and storage solutions
  • Next-gen open GPU platforms
  • Larger-scale PyTorch optimizations
  • Future clusters involving hundreds of thousands of GPUs

Meta’s iterative philosophy remains the same: build, test, optimize, repeat—all in real production environments.

As AGI research accelerates globally, Meta’s massive open-architecture AI clusters will play a pivotal role in shaping the next decade of AI development.

Related

Global Rack Server Solutions for the NVIDIA Blackwell Platform
·341 words·2 mins
NVIDIA Blackwell GB200 Microsoft Azure Google Cloud Meta
NVIDIA’s Core Moat: CUDA
·532 words·3 mins
AI GenAI NVIDIA GPU CUDA
大厂加速自研AI芯片:Nvidia主导地位受到挑战
·17 words·1 min
AI GenAI NVIDIA GPU OpenAI