GB200 NVL72 vs MI355X: Why Systems Win MoE Inference

Table of Contents

Recent Mixture-of-Experts (MoE) inference benchmarks have made one point unmistakably clear: in 2026, systems outperform silicon. NVIDIA’s Blackwell-based GB200 NVL72 rack-scale platform has demonstrated up to 28× higher throughput than AMD’s Instinct MI355X in high-concurrency MoE workloads such as DeepSeek-R1.

This gap is not explained by raw FLOPs. It is explained by architecture above the GPU.

🧠 Why MoE Is a System Problem
#

MoE models behave fundamentally differently from dense transformers. Each generated token dynamically selects a subset of experts, which creates a communication-heavy execution pattern.

Key characteristics of MoE inference:

Sparse but dynamic parameter access: Active weights change per token.
Frequent cross-GPU hops: Tokens must move between experts.
Synchronization sensitivity: Compute stalls if interconnect latency is high.

In practice, this means MoE performance is dominated by interconnect bandwidth, latency, and memory locality, not peak accelerator specs.

🏗️ GB200 NVL72: The System-on-a-Rack Model
#

The GB200 NVL72 is not a cluster of GPUs. It is designed and scheduled as one coherent system.

Architectural advantages
#

NVLink Switch System
Seventy-two Blackwell GPUs are connected through a fully non-blocking NVLink fabric, delivering massive bisection bandwidth inside the rack.
~30 TB of shared HBM memory
Expert parameters can be accessed from a rack-wide memory pool without traversing external InfiniBand or Ethernet.
Intra-rack expert locality
Most MoE “expert hops” remain inside the NVLink domain, avoiding the highest-latency communication paths.

This design compresses the MoE critical path, keeping compute units busy instead of waiting on data.

📊 MoE Inference Results: DeepSeek-R1
#

Signal65 benchmarks using DeepSeek-R1 (FP4) illustrate the scale of the advantage. At a target of 75 tokens/s per GPU, the GB200 NVL72 running NVIDIA Dynamo achieved 7,707 tokens/s total throughput.

Platform	Throughput (tokens/s)	Relative
GB200 NVL72 (Dynamo)	7,707	28×
B200 (standard TensorRT)	Lower	~4–6×
AMD MI355X (vLLM)	Baseline	1×

This delta is structural. Software alone cannot compensate for missing rack-scale coherency and bandwidth.

💰 Performance per Dollar and TCO
#

The throughput advantage compounds directly into Total Cost of Ownership (TCO) benefits.

Based on public cloud pricing and measured throughput:

Cost per generated token
GB200 NVL72 is approximately 15× cheaper than MI355X-class deployments in MoE-heavy scenarios.
Performance per dollar
- ~3.1× advantage at 25 tokens/s
- ~15× advantage at 75 tokens/s

Higher utilization, fewer racks, and lower networking overhead all contribute to the gap.

⚖️ Where MI355X Still Competes
#

AMD’s MI355X remains strong for dense models:

Large HBM3e capacity benefits monolithic parameter access.
Dense inference and training workloads are less sensitive to inter-node latency.

However, AMD currently lacks an equivalent to NVIDIA’s NVLink Switch–based rack fabric, which limits MoE scaling efficiency at high concurrency.

🧭 What This Means for 2026 AI Infrastructure
#

The GB200 NVL72 validates a major shift in AI system design:

The rack, not the GPU, is the new unit of compute.
Interconnect topology is now a first-order performance metric.
MoE workloads amplify architectural differences that dense benchmarks hide.

🧩 Conclusion
#

In MoE inference, system-level design dominates accelerator specifications. NVIDIA’s Blackwell NVL72 demonstrates that tightly integrated rack-scale architectures can unlock orders-of-magnitude gains that no single GPU upgrade can match. For MoE-driven AI deployment in 2026, the competitive edge belongs to platforms built as systems first, chips second.