Skip to main content

AI Super-Cluster Interconnects: NVIDIA, Google, and China's Networking Strategies

·1312 words·7 mins
AI Infrastructure Networking NVIDIA Google TPU InfiniBand RDMA High-Performance Computing Data Centers
Table of Contents

AI Super-Cluster Interconnects: NVIDIA, Google, and China’s Networking Strategies

As AI models continue to scale, networking has become just as important as compute performance. Modern training clusters routinely interconnect thousands of accelerators, while inference platforms increasingly support massive numbers of concurrent requests across geographically distributed infrastructure.

This shift has transformed network topology from an infrastructure concern into a core architectural decision.

Recent developmentsโ€”including Google’s decision to retain a 3D Torus topology for training-oriented TPU 8t while introducing the Boardfly topology for inference-focused TPU 8iโ€”highlight a broader industry trend: network architectures are now designed around workload-specific communication patterns rather than a universal topology.

Today, three distinct design philosophies have emerged:

  • NVIDIA’s universal networking platform
  • Google’s workload-specialized architecture
  • Layered supernode designs adopted by several Chinese hyperscalers

Each reflects a different approach to balancing bandwidth, latency, scalability, and deployment cost.

๐ŸŒ Collective Communication Drives Network Design
#

Large-scale AI training depends on collective communication operations that synchronize computation across thousands of processors.

Rather than optimizing for generic network throughput, modern AI fabrics are increasingly engineered around these communication primitives.

AllReduce
#

AllReduce is the foundation of data parallelism.

Each accelerator contributes locally computed gradients, which are aggregated and redistributed across the cluster.

Characteristics include:

  • Extremely high bandwidth requirements
  • Strict synchronization
  • Large data transfers
  • Sensitivity to tail latency

Even a single slow node can delay an entire training iteration.

ReduceScatter and AllGather
#

Tensor parallelism and model parallelism rely heavily on ReduceScatter and AllGather.

These operations partition and reconstruct distributed tensors while supporting pipeline execution across accelerators.

Compared with AllReduce, they generally exhibit:

  • High sustained bandwidth
  • Moderate latency sensitivity
  • Efficient pipeline utilization

AllToAll
#

The rapid adoption of Mixture-of-Experts (MoE) models has dramatically increased the importance of AllToAll communication.

Unlike other collective operations, AllToAll distributes unique data from every processor to every other processor.

Its requirements include:

  • Very low latency
  • Uniform network bandwidth
  • Low hop counts
  • High bisection bandwidth

Token routing in modern MoE models depends heavily on efficient AllToAll performance, making network topology a critical factor in inference scalability.

๐Ÿ“Š Comparing Three AI Networking Strategies
#

The industry’s leading AI infrastructure providers have adopted different networking philosophies based on their workloads and deployment goals.

Category NVIDIA Google Chinese Supernode Architecture
Scale-Up Network NVLink / NVSwitch 3D Torus (Training), Boardfly (Inference) High-bandwidth intra-cabinet interconnect
Scale-Out Fabric Rail-Optimized InfiniBand / RoCE Virgo Network RDMA or InfiniBand
Collective Offload SHARP, NVLS Collectives Acceleration Engine (CAE) Custom hardware acceleration
Design Goal Universal infrastructure Workload specialization Performance-to-cost optimization

Although each strategy differs significantly, all seek to minimize communication overhead as AI workloads continue expanding.

๐Ÿš€ NVIDIA: A Universal Networking Platform
#

NVIDIA’s architecture emphasizes flexibility.

Rather than designing separate infrastructures for training and inference, NVIDIA provides a unified networking platform capable of supporting diverse AI workloads.

NVLink and NVSwitch #

Within a server or tightly coupled compute domain, GPUs communicate through NVLink and NVSwitch.

These technologies provide:

  • Extremely high bandwidth
  • Low communication latency
  • Unified memory access
  • Efficient tensor exchange

Recent generations significantly expand the number of GPUs that can participate within a single NVLink domain.

Rail-Optimized Scale-Out Networking
#

Beyond individual compute nodes, NVIDIA employs a Rail-Optimized topology.

In this design, GPUs occupying identical physical positions across multiple servers connect to the same leaf switch.

Benefits include:

  • Reduced spine congestion
  • Predictable communication paths
  • Improved locality
  • Lower latency for synchronized workloads

This organization aligns naturally with many distributed training algorithms.

Topology-Aware Communication
#

NVIDIA’s NCCL library automatically selects communication paths based on hardware topology.

Depending on message location, NCCL may:

  • Remain entirely within NVLink
  • Traverse a single network rail
  • Route through the broader InfiniBand or RoCE fabric

Additional acceleration technologies such as SHARP move portions of reduction operations into the network itself, reducing CPU and GPU overhead while improving scaling efficiency.

๐Ÿง  Google: Separate Architectures for Training and Inference
#

Google has adopted a different philosophy.

Instead of pursuing one universal network, the company designs distinct architectures for fundamentally different workloads.

TPU 8t: Optimized for Training
#

Training workloads emphasize repeated synchronization among neighboring processors.

Accordingly, TPU 8t retains a 3D Torus topology.

Characteristics include:

  • Neighbor-to-neighbor communication
  • Predictable routing
  • Efficient synchronization
  • Excellent scalability for dense model training

Although the network diameter grows as clusters expand, deterministic communication patterns help amortize latency across training iterations.

Large TPU deployments are connected through Google’s Virgo network, a high-radix, non-blocking fabric designed to support extremely large accelerator clusters.

TPU 8i: Boardfly for Inference
#

Inference workloads exhibit very different communication behavior.

MoE routing requires tokens to travel dynamically among experts located throughout the cluster.

To reduce communication latency, Google introduced Boardfly, a hierarchical topology inspired by Dragonfly networks.

The architecture organizes hardware into multiple layers:

  1. Small groups of chips connected locally
  2. Boards interconnected with high-bandwidth backplanes
  3. Larger board groups linked through Optical Circuit Switches (OCS)

Compared with large Torus networks, this structure significantly reduces the maximum number of communication hops while improving routing flexibility for inference workloads.

Dedicated Collective Acceleration
#

Google further accelerates communication using a Collectives Acceleration Engine (CAE) integrated directly into the hardware.

By offloading collective communication operations from Tensor processing units, CAE reduces latency while allowing compute resources to remain focused on AI inference.

๐Ÿข Chinese Supernode Architectures
#

Several Chinese hyperscale cloud providers have adopted a layered networking strategy that separates local communication from cluster-wide communication.

Rather than treating every server equally, these architectures divide infrastructure into two distinct domains.

High-Bandwidth Local Domain
#

Within each cabinet or enclosure, processors communicate through proprietary high-bandwidth interconnects.

These tightly integrated compute pools function as large “supernodes” capable of executing communication-intensive workloads locally.

Operations such as:

  • Tensor Parallelism
  • Expert Parallelism
  • Local reductions

remain inside these high-speed domains whenever possible.

Cost-Optimized Global Network
#

Communication between supernodes occurs through conventional high-performance networking technologies such as RDMA or InfiniBand.

This layered design provides several advantages:

  • Lower infrastructure costs
  • Smaller fault domains
  • Simplified expansion
  • Efficient utilization of premium networking resources

By containing the most demanding communication within local hardware pools, overall network complexity can be significantly reduced.

๐Ÿ”„ Networking Becomes an Active Compute Layer
#

Perhaps the most important industry trend is that AI networks are evolving beyond passive data transport.

Increasingly, networking hardware participates directly in distributed computation.

Three major developments illustrate this transformation.

Topology-Aware Communication
#

Communication libraries such as NCCL and HCCL now optimize message routing according to physical topology.

Rather than assuming uniform connectivity, they prioritize:

  • Local communication
  • Hierarchical reductions
  • Reduced cross-network traffic
  • Better bandwidth utilization

This significantly improves scalability for large AI clusters.

Hardware-Offloaded Collectives
#

Collective operations are increasingly executed inside networking hardware.

Examples include:

  • NVIDIA SHARP
  • Google’s Collectives Acceleration Engine
  • Vendor-specific network acceleration engines

These technologies reduce communication overhead while freeing compute resources for model execution.

Dynamically Reconfigurable Networks
#

Optical Circuit Switches are introducing unprecedented flexibility into AI infrastructure.

Instead of relying on fixed network topologies, future AI clusters may dynamically reconfigure physical connectivity according to workload requirements.

Potential capabilities include:

  • Failure isolation
  • Congestion avoidance
  • Adaptive routing
  • Training-specific topologies
  • Inference-specific topologies

Dynamic optical networking represents a significant step toward software-defined AI infrastructure.

๐Ÿ” Outlook
#

The evolution of AI networking demonstrates that compute performance alone is no longer sufficient to scale modern machine learning systems.

Communication patterns increasingly dictate cluster architecture, influencing everything from topology selection and hardware acceleration to software scheduling and optical networking.

NVIDIA continues to emphasize a universal infrastructure capable of supporting diverse workloads through tightly integrated hardware and software.

Google has instead embraced workload specialization, deploying separate topologies optimized for training and inference.

Meanwhile, Chinese hyperscalers are pursuing layered supernode architectures that balance performance with deployment cost by separating local high-bandwidth communication from large-scale RDMA networking.

As AI clusters continue growing in size and complexity, networking is evolving from a passive transport layer into an active participant in distributed computation. Future AI supercomputers will likely combine intelligent routing, topology-aware software, hardware-accelerated collectives, and dynamically reconfigurable optical fabrics to maximize efficiency across increasingly diverse workloads.

Related

Why NVIDIA Sees Co-Packaged Optics as the Future of AI Networking
·1636 words·8 mins
NVIDIA Broadcom Co-Packaged Optics CPO AI Infrastructure Silicon Photonics Data Centers Networking Spectrum-X NVLink
ASIC Commercialization Reaches a Turning Point in the AI Era
·1348 words·7 mins
ASIC AI Chips Semiconductors Cloud Computing OpenAI Google TPU Amazon Trainium Broadcom AI Infrastructure Data Centers
Hygon Unveils 128-Core C86 CPU and Full-Stack Data Center Platform
·965 words·5 mins
Hygon Server-Cpu Data Center AI Infrastructure High-Performance Computing PCIe 5.0 Networking Cloud Computing