Skip to main content

AWS Resilient Network Graphs: Reinventing Data Center Networking

·1865 words·9 mins
AWS Data Center Networking Resilient Network Graphs Cloud Infrastructure Expander Graphs Networking Hyperscale Distributed Systems Network Architecture Cloud Computing
Table of Contents

AWS Resilient Network Graphs: Reinventing Data Center Networking

For more than a decade, Clos and Fat-tree topologies have served as the foundation of hyperscale cloud networking. From enterprise data centers to the largest public cloud providers, hierarchical network fabrics have become the industry standard for delivering predictable bandwidth and operational simplicity.

However, traditional Fat-tree architectures face a fundamental tradeoff: achieving non-blocking performance requires significant overprovisioning of switches, optical transceivers, and fiber infrastructure, while cost-optimized deployments often suffer from congestion and inefficient utilization.

AWS’s Resilient Network Graphs (RNG) architecture introduces a fundamentally different approach. Rather than relying on hierarchical layers of aggregation and spine switches, RNG adopts a flat, expander-inspired topology that distributes connectivity across the network fabric. The result is a system capable of delivering equivalentโ€”or in many scenarios superiorโ€”performance while reducing infrastructure costs by as much as 45% and lowering router requirements by 69%.

๐ŸŒ The Evolution Beyond Fat-Tree Networking
#

Modern cloud infrastructure demands massive east-west traffic capacity. Distributed storage systems, container orchestration platforms, microservices, AI workloads, and large-scale databases all require highly efficient server-to-server communication.

Traditional Fat-tree networks address this requirement through a layered design:

Servers
   โ”‚
   โ–ผ
Top-of-Rack (ToR)
   โ”‚
   โ–ผ
Aggregation Layer
   โ”‚
   โ–ผ
Core / Spine Layer

Every packet must traverse the hierarchy, moving upward through aggregation and spine switches before reaching its destination.

While this structure simplifies routing and operations, it introduces several limitations.

Limited Capacity Flexibility
#

Traffic flows are constrained to a predefined set of paths determined by the network hierarchy.

When certain links become congested:

  • Alternative paths may remain underutilized.
  • Available bandwidth cannot be fully exploited.
  • Traffic hotspots emerge despite unused network capacity elsewhere.

This phenomenon leads to inefficient resource utilization and forces operators to provision excess infrastructure.

Cost Escalation
#

To maintain predictable performance under bursty workloads, cloud providers frequently deploy:

  • Additional spine switches
  • Redundant uplinks
  • Extra optical transceivers
  • Larger fiber footprints

As network scale increases, these costs grow rapidly.

Concentrated Failure Domains
#

Although Fat-tree architectures offer redundancy, upper-layer failures can still impact large portions of the network.

For example:

  • A failed spine switch may significantly reduce available bandwidth.
  • Maintenance operations often affect entire sections of the hierarchy.
  • Capacity reductions tend to be localized and highly visible.

These limitations motivated researchers to explore alternative topologies.

๐Ÿ”ฌ Why Expander Graphs Have Long Been Considered Ideal
#

For years, networking researchers have viewed expander graphs as one of the most promising architectures for large-scale distributed systems.

Unlike hierarchical networks, expander graphs distribute connectivity more uniformly across the fabric.

Key Advantages of Expander Networks
#

High Connectivity
#

Every subset of switches maintains extensive connectivity to the rest of the network.

Benefits include:

  • Better bandwidth utilization
  • Improved traffic distribution
  • Reduced hotspot formation
  • Greater path diversity

Strong Fault Tolerance
#

Failures have a proportional impact rather than a concentrated one.

For example:

  • A 1% device failure typically results in approximately 1% capacity loss.
  • No single switch becomes a critical bottleneck.
  • Network degradation remains predictable.

Lower Infrastructure Costs
#

Expander designs can achieve similar throughput levels while eliminating significant portions of traditional aggregation infrastructure.

Potential savings include:

  • Fewer switches
  • Fewer optical links
  • Reduced rack space
  • Lower power consumption

Despite these advantages, expander-based networks remained largely confined to academic research for over a decade.

๐Ÿšง The Three Barriers Preventing Commercial Adoption
#

Although concepts such as the Jellyfish topology demonstrated impressive theoretical performance, practical deployment faced three major challenges.

Routing Complexity
#

Fat-tree networks benefit from simple routing mechanisms based largely on shortest-path forwarding.

Expander networks introduce:

  • Large numbers of possible paths
  • Dynamic traffic patterns
  • Complex forwarding decisions

Many proposed solutions required enormous routing tables that exceeded the capabilities of commodity switching hardware.

Physical Cabling Challenges
#

Randomized topologies create highly irregular connectivity patterns.

In a real data center this can result in:

  • Long-distance fiber runs
  • Complex cable management
  • Difficult expansion procedures
  • Increased deployment risk

Adding new racks often requires rewiring existing infrastructure.

Lack of Predictable Design Models
#

Network architects require deterministic planning tools.

Traditional expander proposals often relied heavily on simulations rather than analytical models, making it difficult to answer practical questions such as:

  • How many switches are required?
  • How many uplinks should each switch have?
  • What oversubscription ratio will result?

Without these answers, standardization becomes difficult.

๐Ÿš€ AWS RNG: Three Innovations That Changed the Equation
#

AWS addressed each of these historical limitations through a combination of routing innovation, optical infrastructure design, and mathematical modeling.

๐Ÿ”€ Spraypoint Routing
#

At the heart of RNG is Spraypoint, a distributed routing algorithm designed specifically for large-scale expander-style fabrics.

Rather than maintaining extensive path information, Spraypoint uses a two-stage forwarding process.

Source Spraying
#

When traffic enters the network, the ingress switch distributes packets across multiple uplinks using Equal-Cost Multi-Path (ECMP) hashing.

Source ToR
     โ”‚
     โ–ผ
Randomized ECMP Distribution
     โ”‚
     โ–ผ
Multiple Intermediate Paths

This immediately disperses traffic across the fabric.

Destination Convergence
#

To prevent uncontrolled path wandering, AWS introduces destination-oriented waypoint guidance.

Source
   โ”‚
   โ–ผ
Randomized Distribution
   โ”‚
   โ–ผ
Waypoint Guidance
   โ”‚
   โ–ผ
Destination

Packets gradually converge toward predefined waypoint tiers near the destination.

The result is:

  • High path diversity
  • Low routing-table requirements
  • Efficient load balancing
  • Compatibility with commodity switching hardware

Topology Hiding
#

A particularly elegant aspect of Spraypoint is its topology abstraction.

Switches do not need complete knowledge of the entire network.

Instead, they only determine:

  • Relative destination tier
  • Appropriate ECMP forwarding decision

This keeps routing state requirements comparable to conventional Fat-tree deployments.

๐Ÿ”Œ ShuffleBox: Solving the Cabling Problem
#

Routing alone cannot solve physical deployment challenges.

To address infrastructure complexity, AWS developed ShuffleBox, a passive optical interconnection system.

Unlike active switching equipment, ShuffleBox contains:

  • No processors
  • No software
  • No power requirements

Its purpose is purely organizational.

Core Architecture
#

Each ShuffleBox includes:

  • Local switch-facing ports (R-ports)
  • Backbone connectivity ports (C-ports)
  • Internal passive optical interconnections

Combined with companion ShuffleBack adapters, the system creates controlled randomized connectivity without requiring chaotic physical cabling.

Simplified Fabric Deployment
#

Traditional random topologies often resemble a wiring nightmare.

ShuffleBox changes the model.

Local ToR Switches
         โ”‚
         โ–ผ
     ShuffleBox
         โ”‚
         โ–ผ
 Backbone Links
         โ”‚
         โ–ผ
 Remote ShuffleBox

Benefits include:

  • Short local cable runs
  • Simplified deployment
  • Predictable infrastructure layouts
  • Easier expansion

When new halls are added, only a limited number of backbone connections require modification.

Existing switch wiring remains untouched.

๐Ÿ“Š Mathematical Models Replace Guesswork
#

One of the most significant contributions of RNG is the introduction of analytical performance models.

Rather than relying exclusively on simulation, AWS developed formulas capable of predicting network behavior.

Predictable Engineering Design
#

Using switch count, port allocations, and routing parameters, architects can estimate:

  • Independent path counts
  • Path length distributions
  • Oversubscription ratios
  • Network capacity

This transforms RNG from a research concept into a repeatable engineering framework.

Path Length Efficiency
#

Despite its randomized nature, AWS reports that most traffic traverses only:

  • 3 hops
  • 4 hops

In many scenarios, average path lengths are actually shorter than those found in traditional multi-tier Fat-tree fabrics.

๐Ÿ“ˆ Production Deployment Results
#

AWS deployed RNG in production environments across two primary categories:

  • Server Mesh networks
  • Edge Mesh infrastructures

The company then compared RNG directly against equivalent Fat-tree deployments using identical hardware resources.

โšก Performance Under Real Workloads
#

Testing focused on three common traffic patterns.

Clique Traffic
#

Every node communicates with every other node.

Typical examples include:

  • Distributed storage
  • Data replication
  • Analytics frameworks

Hub Traffic
#

Many devices communicate with a small set of central systems.

Examples include:

  • Databases
  • API gateways
  • Metadata services

Matching Traffic
#

One-to-one communication patterns designed to expose congestion bottlenecks.

Performance Findings
#

Under light utilization levels:

  • Fat-tree maintained a small advantage.
  • Performance differences ranged between 5% and 10%.

Under moderate and heavy workloads:

  • RNG delivered up to 33% higher throughput.
  • Bandwidth utilization improved significantly.
  • Congestion handling became more efficient.

For cloud environments characterized by bursty multi-tenant traffic, RNG demonstrated clear advantages.

๐Ÿ’ฐ Cost and Infrastructure Savings
#

One of RNG’s most compelling benefits is infrastructure efficiency.

Non-Blocking Deployments
#

For fully non-blocking designs, AWS reported approximately:

  • 9% infrastructure cost reduction

Commercial Oversubscribed Deployments
#

For practical production environments using moderate oversubscription:

  • Up to 45% lower hardware costs

Savings come primarily from reductions in:

  • Aggregation switches
  • Spine switches
  • Optical transceivers
  • Supporting infrastructure

Router Reduction
#

AWS also reported up to:

  • 69% fewer routers

compared with equivalent Fat-tree architectures.

๐Ÿ›  Addressing Random Topology Challenges
#

Randomized networks can potentially increase latency due to longer physical paths.

AWS mitigated this through two optimizations.

Localized Waypoint Preference
#

Routing algorithms prioritize nearby waypoints whenever possible.

Benefits include:

  • Reduced hop distance
  • Lower propagation delay
  • Improved traffic locality

Controlled Backbone Connectivity
#

AWS carefully limits long-distance inter-hall links while increasing local connectivity density.

This balances:

  • Capacity
  • Cost
  • Latency

The result is latency performance that closely matches traditional Fat-tree networks.

๐Ÿ“ฆ Scaling During Incremental Deployment
#

Another challenge involves partially populated facilities.

Early deployment phases often suffer from limited connectivity density.

AWS addressed this by introducing staged onboarding procedures.

Rather than connecting every rack immediately:

  1. Initial rack groups are deployed.
  2. Connectivity stabilizes.
  3. Additional racks are gradually integrated.

This strategy maintains high uplink utilization even when facilities are only partially occupied.

๐Ÿ”ง Operational Considerations
#

Moving from a hierarchical architecture to a flat mesh introduces operational changes.

Tooling Adaptation
#

Many cloud management platforms assume:

  • Spine layers
  • Aggregation layers
  • Hierarchical fault domains

RNG requires updates to:

  • Monitoring systems
  • Troubleshooting tools
  • Capacity planning software

Maintenance Workflows
#

Traditional maintenance procedures often isolate entire rows or layers.

In a mesh network:

  • Maintenance becomes more distributed.
  • Upgrade strategies must adapt to graph-based connectivity.
  • Operational automation becomes increasingly important.

These changes primarily affect software and operational processes rather than physical infrastructure.

๐ŸŽฏ Strategic Significance
#

AWS’s RNG architecture represents more than an incremental networking improvement.

It demonstrates that large-scale expander networks can finally move beyond academic research and into commercial production.

Commercializing Expander Graphs
#

For over a decade, expander topologies remained theoretically attractive but operationally impractical.

RNG establishes a viable path toward large-scale deployment.

Redefining Cloud Economics
#

By simultaneously improving throughput and reducing infrastructure costs, RNG fundamentally changes the cost-performance equation for hyperscale cloud providers.

A Foundation for Future Compute Clusters
#

Although AWS currently targets general-purpose cloud workloads, RNG introduces concepts that could influence future architectures for:

  • AI infrastructure
  • Large-scale inference clusters
  • Distributed training environments
  • High-performance computing systems

As compute clusters continue growing in size and complexity, flat graph-based networking may become an increasingly attractive alternative to traditional hierarchical fabrics.

๐Ÿ”ฎ Conclusion
#

AWS’s Resilient Network Graphs architecture represents one of the most significant data center networking innovations in recent years. By combining the theoretical strengths of expander graphs with practical solutions such as Spraypoint routing, ShuffleBox optical infrastructure, and analytical design models, AWS has transformed a long-standing academic concept into a production-ready cloud networking platform.

The results are compelling: lower infrastructure costs, fewer networking devices, improved resilience, and stronger performance under real-world cloud traffic patterns. More importantly, RNG challenges the assumption that hierarchical Clos and Fat-tree fabrics are the inevitable end state for hyperscale networking.

As cloud providers continue searching for more efficient ways to scale infrastructure, RNG may prove to be the beginning of a broader shift toward flat, graph-based network architectures capable of supporting the next generation of distributed computing systems.

Related

Meta Scales AI with Graviton CPUs: Scheduling Becomes Key
·854 words·5 mins
Meta AWS Graviton AI Infrastructure Distributed Systems Agentic-Ai Cloud Computing Scalability
Cisco Q3 2026 Earnings Reveal AI Networkingโ€™s New Power Shift
·1301 words·7 mins
Cisco AI Infrastructure Data Center Networking Silicon One Hyperscalers Ethernet Fabrics SRv6 Cloud Computing AI Clusters Networking
Solid-State Transformers: Fixing AI Data Center Power Limits
·722 words·4 mins
AI Infrastructure Data Centers Power Systems Solid-State Transformers Semiconductors Energy Efficiency Hyperscale Cloud Computing