Open Scale-Up Ethernet: The New Battleground for AI Infrastructure
By 2026, a fundamental shift has taken hold in AI system design: the move away from proprietary, closed interconnects (such as NVLink-style architectures) toward Open Scale-Up Ethernet.
This transition is not incremental—it is structural.
As AI models push beyond the trillion-parameter scale, the traditional distinction between compute and communication is breaking down. The network inside the server—connecting GPUs, CPUs, and accelerators—has become just as critical as the network between servers.
In this new reality, Ethernet is no longer just a transport layer.
It is becoming the fabric of the AI supercomputer.
⚖️ The Architectural Divide: Compatibility vs. Performance #
At the heart of today’s Scale-up Ethernet landscape lies a clear philosophical split. Every protocol is essentially answering the same question:
Should we evolve Ethernet—or replace it?
Option 1: The Compatible Path #
These designs extend traditional Ethernet rather than reinvent it.
- Preserve standard Layer 2 headers
- Use the EtherType field to signal AI-specific metadata
- Add reliability mechanisms at the link layer
Advantages:
- Seamless integration with existing switch infrastructure
- Lower engineering risk and faster deployment
- Mature tooling and observability
Trade-offs:
- Higher protocol overhead
- Reduced effective bandwidth (“goodput”)
- Incremental—not radical—performance gains
Option 2: The Optimized Path #
These protocols take a far more aggressive stance—removing legacy constraints entirely.
- Replace Ethernet headers with fused L2/L3 formats
- Encode routing, addressing, and control in compact frames
- Design for deterministic latency and maximum payload efficiency
Advantages:
- Near wire-speed efficiency
- Ultra-low latency and jitter
- Better scaling for tightly coupled GPU clusters
Trade-offs:
- Requires new switch ASICs
- Breaks compatibility with traditional Ethernet tooling
- Higher upfront ecosystem cost
🔬 The Five Contenders: Design Philosophies in Action #
1. ESUN v1.0 (OCP / Meta / Microsoft) #
ESUN represents the pragmatic baseline for Scale-up Ethernet adoption.
- Header Model: 14-byte Ethernet + 4-byte extension
- Core Feature: Link Layer Retry (LLR) for fast, localized retransmissions
- Design Goal: Balance compatibility with improved reliability
Why it matters:
ESUN is currently the default choice for hyperscalers in North America. It proves that meaningful gains can be achieved without breaking the Ethernet model.
2. AFH Gen1 & Gen2 (Broadcom SUE) #
Broadcom’s Scale-Up Ethernet (SUE) strategy is deliberately dual-track.
-
Gen1 (SUE-Lite):
- Compatible approach similar to ESUN
- Credit-based flow control for predictable latency
-
Gen2 (Full SUE):
- 12-byte fused header
- Introduces SLAP (Structured Local Address Plan)
- Eliminates traditional MAC learning
Why it matters:
Gen2 represents one of the most aggressive pushes toward hardware-optimized Ethernet, enabling near wire-speed switching with minimal jitter.
3. ETH-X (Tencent / ODCC) #
ETH-X shifts the focus upward—to the transaction layer.
- Key Innovation: PAXI (Peer-to-Peer AXI)
- Directly maps the GPU’s internal AXI bus onto the network
- Frame Format: 12-byte PRI header
- Efficiency: ~90% payload utilization for 128B transfers
Why it matters:
ETH-X blurs the boundary between on-chip communication and network transport, effectively extending the GPU’s internal fabric across nodes.
4. OISA (China Mobile) #
OISA is built around hardware-software co-design for dense clusters.
- Header: Flat 16-byte format
- Core Mechanism: Tag-ID-based direct memory access
- Topology Focus: Symmetric CPU↔GPU and GPU↔GPU traffic
Scaling Model:
Optimized for up to 1,024 GPUs within a single scale-up domain.
Why it matters:
OISA prioritizes deterministic behavior and tight coupling, making it ideal for controlled, high-density deployments.
5. ETH+ (Alibaba / HTE Alliance) #
ETH+ is arguably the most ambitious—and forward-looking—approach.
- Unified Fabric Vision: Handles both Scale-up and Scale-out
- Link-Bypass Mode: Removes traditional headers entirely in homogeneous environments
- IFEC (In-Fabric Extended Computation):
- Performs operations like All-Reduce inside the switch
- Reduces synchronization bottlenecks
Why it matters:
ETH+ is not just a transport protocol—it’s a distributed compute fabric, directly competing with proprietary solutions like in-network reduction engines.
📊 Performance Snapshot: 2026 Comparison #
| Protocol | Header Size | Effective Payload (128B) | Core Advantage |
|---|---|---|---|
| ESUN v1.0 | 18 Bytes | 81.0% | Ecosystem Compatibility |
| AFH Gen2 | 12 Bytes | 84.2% | Ultra-low ASIC Latency |
| ETH-X | 16 Bytes | 88.9% | Native AXI Mapping |
| OISA 2.0 | 16 Bytes | 82.0% | Hardware Co-Design |
| ETH+ | 16 Bytes | 85.9% | Unified Fabric + In-Network Compute |
🔮 Future Trajectory: Toward “Everything over Ethernet” #
While fragmentation defines 2026, convergence is already underway. Several clear trends are shaping the next phase:
1. Memory Semantics Over Messaging #
Protocols are evolving toward load/store semantics, where remote GPU memory behaves like local memory.
2. Ultra-Compact Headers #
The industry is targeting sub-10-byte headers by 2028, driven by bandwidth pressure from Mixture-of-Experts (MoE) models.
3. In-Network Computing Becomes Mandatory #
Offloading collectives (e.g., All-Reduce) into switches is no longer optional—it’s essential to overcome the communication wall.
4. Scale-Up Meets Scale-Out #
The boundary between intra-node and inter-node networking is disappearing, pushing toward fabric unification.
5. Standardization vs. Reality #
Despite efforts toward open standards, the ecosystem remains fragmented.
In practice, hyperscalers are choosing protocols based on GPU vendor alignment and supply chain constraints, not ideology.
🧠 Final Take: What Actually Determines the Winner? #
The winner of the Scale-up Ethernet race will not simply be the protocol with the smallest header or highest theoretical throughput.
It will be the one that answers a far more difficult challenge:
Can 50,000 GPUs behave like a single coherent system?
- ESUN / SUE lead in real-world deployment and ecosystem readiness
- ETH+ leads in architectural ambition and long-term flexibility
The endgame is clear:
Ethernet is no longer just the network.
It is becoming the operating fabric of AI infrastructure.