AI Super-Cluster Interconnects: NVIDIA, Google, and China's Networking Strategies

Table of Contents

AI Super-Cluster Interconnects: NVIDIA, Google, and China’s Networking Strategies

As AI models continue to scale, networking has become just as important as compute performance. Modern training clusters routinely interconnect thousands of accelerators, while inference platforms increasingly support massive numbers of concurrent requests across geographically distributed infrastructure.

This shift has transformed network topology from an infrastructure concern into a core architectural decision.

Recent developments—including Google’s decision to retain a 3D Torus topology for training-oriented TPU 8t while introducing the Boardfly topology for inference-focused TPU 8i—highlight a broader industry trend: network architectures are now designed around workload-specific communication patterns rather than a universal topology.

Today, three distinct design philosophies have emerged:

NVIDIA’s universal networking platform
Google’s workload-specialized architecture
Layered supernode designs adopted by several Chinese hyperscalers

Each reflects a different approach to balancing bandwidth, latency, scalability, and deployment cost.

🌐 Collective Communication Drives Network Design
#

Large-scale AI training depends on collective communication operations that synchronize computation across thousands of processors.

Rather than optimizing for generic network throughput, modern AI fabrics are increasingly engineered around these communication primitives.

AllReduce
#

AllReduce is the foundation of data parallelism.

Each accelerator contributes locally computed gradients, which are aggregated and redistributed across the cluster.

Characteristics include:

Extremely high bandwidth requirements
Strict synchronization
Large data transfers
Sensitivity to tail latency

Even a single slow node can delay an entire training iteration.

ReduceScatter and AllGather
#

Tensor parallelism and model parallelism rely heavily on ReduceScatter and AllGather.

These operations partition and reconstruct distributed tensors while supporting pipeline execution across accelerators.

Compared with AllReduce, they generally exhibit:

High sustained bandwidth
Moderate latency sensitivity
Efficient pipeline utilization

AllToAll
#

The rapid adoption of Mixture-of-Experts (MoE) models has dramatically increased the importance of AllToAll communication.

Unlike other collective operations, AllToAll distributes unique data from every processor to every other processor.

Its requirements include:

Very low latency
Uniform network bandwidth
Low hop counts
High bisection bandwidth

Token routing in modern MoE models depends heavily on efficient AllToAll performance, making network topology a critical factor in inference scalability.

📊 Comparing Three AI Networking Strategies
#

The industry’s leading AI infrastructure providers have adopted different networking philosophies based on their workloads and deployment goals.

Category	NVIDIA	Google	Chinese Supernode Architecture
Scale-Up Network	NVLink / NVSwitch	3D Torus (Training), Boardfly (Inference)	High-bandwidth intra-cabinet interconnect
Scale-Out Fabric	Rail-Optimized InfiniBand / RoCE	Virgo Network	RDMA or InfiniBand
Collective Offload	SHARP, NVLS	Collectives Acceleration Engine (CAE)	Custom hardware acceleration
Design Goal	Universal infrastructure	Workload specialization	Performance-to-cost optimization

Although each strategy differs significantly, all seek to minimize communication overhead as AI workloads continue expanding.

🚀 NVIDIA: A Universal Networking Platform
#

NVIDIA’s architecture emphasizes flexibility.

Rather than designing separate infrastructures for training and inference, NVIDIA provides a unified networking platform capable of supporting diverse AI workloads.

NVLink and NVSwitch
#

Within a server or tightly coupled compute domain, GPUs communicate through NVLink and NVSwitch.

These technologies provide:

Extremely high bandwidth
Low communication latency
Unified memory access
Efficient tensor exchange

Recent generations significantly expand the number of GPUs that can participate within a single NVLink domain.

Rail-Optimized Scale-Out Networking
#

Beyond individual compute nodes, NVIDIA employs a Rail-Optimized topology.

In this design, GPUs occupying identical physical positions across multiple servers connect to the same leaf switch.

Benefits include:

Reduced spine congestion
Predictable communication paths
Improved locality
Lower latency for synchronized workloads

This organization aligns naturally with many distributed training algorithms.

Topology-Aware Communication
#

NVIDIA’s NCCL library automatically selects communication paths based on hardware topology.

Depending on message location, NCCL may:

Remain entirely within NVLink
Traverse a single network rail
Route through the broader InfiniBand or RoCE fabric

Additional acceleration technologies such as SHARP move portions of reduction operations into the network itself, reducing CPU and GPU overhead while improving scaling efficiency.

🧠 Google: Separate Architectures for Training and Inference
#

Google has adopted a different philosophy.

Instead of pursuing one universal network, the company designs distinct architectures for fundamentally different workloads.

TPU 8t: Optimized for Training
#

Training workloads emphasize repeated synchronization among neighboring processors.

Accordingly, TPU 8t retains a 3D Torus topology.

Characteristics include:

Neighbor-to-neighbor communication
Predictable routing
Efficient synchronization
Excellent scalability for dense model training

Although the network diameter grows as clusters expand, deterministic communication patterns help amortize latency across training iterations.

Large TPU deployments are connected through Google’s Virgo network, a high-radix, non-blocking fabric designed to support extremely large accelerator clusters.

TPU 8i: Boardfly for Inference
#

Inference workloads exhibit very different communication behavior.

MoE routing requires tokens to travel dynamically among experts located throughout the cluster.

To reduce communication latency, Google introduced Boardfly, a hierarchical topology inspired by Dragonfly networks.

The architecture organizes hardware into multiple layers:

Small groups of chips connected locally
Boards interconnected with high-bandwidth backplanes
Larger board groups linked through Optical Circuit Switches (OCS)

Compared with large Torus networks, this structure significantly reduces the maximum number of communication hops while improving routing flexibility for inference workloads.

Dedicated Collective Acceleration
#

Google further accelerates communication using a Collectives Acceleration Engine (CAE) integrated directly into the hardware.

By offloading collective communication operations from Tensor processing units, CAE reduces latency while allowing compute resources to remain focused on AI inference.

🏢 Chinese Supernode Architectures
#

Several Chinese hyperscale cloud providers have adopted a layered networking strategy that separates local communication from cluster-wide communication.

Rather than treating every server equally, these architectures divide infrastructure into two distinct domains.

High-Bandwidth Local Domain
#

Within each cabinet or enclosure, processors communicate through proprietary high-bandwidth interconnects.

These tightly integrated compute pools function as large “supernodes” capable of executing communication-intensive workloads locally.

Operations such as:

Tensor Parallelism
Expert Parallelism
Local reductions

remain inside these high-speed domains whenever possible.

Cost-Optimized Global Network
#

Communication between supernodes occurs through conventional high-performance networking technologies such as RDMA or InfiniBand.

This layered design provides several advantages:

Lower infrastructure costs
Smaller fault domains
Simplified expansion
Efficient utilization of premium networking resources

By containing the most demanding communication within local hardware pools, overall network complexity can be significantly reduced.

🔄 Networking Becomes an Active Compute Layer
#

Perhaps the most important industry trend is that AI networks are evolving beyond passive data transport.

Increasingly, networking hardware participates directly in distributed computation.

Three major developments illustrate this transformation.

Topology-Aware Communication
#

Communication libraries such as NCCL and HCCL now optimize message routing according to physical topology.

Rather than assuming uniform connectivity, they prioritize:

Local communication
Hierarchical reductions
Reduced cross-network traffic
Better bandwidth utilization

This significantly improves scalability for large AI clusters.

Hardware-Offloaded Collectives
#

Collective operations are increasingly executed inside networking hardware.

Examples include:

NVIDIA SHARP
Google’s Collectives Acceleration Engine
Vendor-specific network acceleration engines

These technologies reduce communication overhead while freeing compute resources for model execution.

Dynamically Reconfigurable Networks
#

Optical Circuit Switches are introducing unprecedented flexibility into AI infrastructure.

Instead of relying on fixed network topologies, future AI clusters may dynamically reconfigure physical connectivity according to workload requirements.

Potential capabilities include:

Failure isolation
Congestion avoidance
Adaptive routing
Training-specific topologies
Inference-specific topologies

Dynamic optical networking represents a significant step toward software-defined AI infrastructure.

🔍 Outlook
#

The evolution of AI networking demonstrates that compute performance alone is no longer sufficient to scale modern machine learning systems.

Communication patterns increasingly dictate cluster architecture, influencing everything from topology selection and hardware acceleration to software scheduling and optical networking.

NVIDIA continues to emphasize a universal infrastructure capable of supporting diverse workloads through tightly integrated hardware and software.

Google has instead embraced workload specialization, deploying separate topologies optimized for training and inference.

Meanwhile, Chinese hyperscalers are pursuing layered supernode architectures that balance performance with deployment cost by separating local high-bandwidth communication from large-scale RDMA networking.

As AI clusters continue growing in size and complexity, networking is evolving from a passive transport layer into an active participant in distributed computation. Future AI supercomputers will likely combine intelligent routing, topology-aware software, hardware-accelerated collectives, and dynamically reconfigurable optical fabrics to maximize efficiency across increasingly diverse workloads.