AWS Resilient Network Graphs: Reinventing Data Center Networking
For more than a decade, Clos and Fat-tree topologies have served as the foundation of hyperscale cloud networking. From enterprise data centers to the largest public cloud providers, hierarchical network fabrics have become the industry standard for delivering predictable bandwidth and operational simplicity.
However, traditional Fat-tree architectures face a fundamental tradeoff: achieving non-blocking performance requires significant overprovisioning of switches, optical transceivers, and fiber infrastructure, while cost-optimized deployments often suffer from congestion and inefficient utilization.
AWS’s Resilient Network Graphs (RNG) architecture introduces a fundamentally different approach. Rather than relying on hierarchical layers of aggregation and spine switches, RNG adopts a flat, expander-inspired topology that distributes connectivity across the network fabric. The result is a system capable of delivering equivalentโor in many scenarios superiorโperformance while reducing infrastructure costs by as much as 45% and lowering router requirements by 69%.
๐ The Evolution Beyond Fat-Tree Networking #
Modern cloud infrastructure demands massive east-west traffic capacity. Distributed storage systems, container orchestration platforms, microservices, AI workloads, and large-scale databases all require highly efficient server-to-server communication.
Traditional Fat-tree networks address this requirement through a layered design:
Servers
โ
โผ
Top-of-Rack (ToR)
โ
โผ
Aggregation Layer
โ
โผ
Core / Spine Layer
Every packet must traverse the hierarchy, moving upward through aggregation and spine switches before reaching its destination.
While this structure simplifies routing and operations, it introduces several limitations.
Limited Capacity Flexibility #
Traffic flows are constrained to a predefined set of paths determined by the network hierarchy.
When certain links become congested:
- Alternative paths may remain underutilized.
- Available bandwidth cannot be fully exploited.
- Traffic hotspots emerge despite unused network capacity elsewhere.
This phenomenon leads to inefficient resource utilization and forces operators to provision excess infrastructure.
Cost Escalation #
To maintain predictable performance under bursty workloads, cloud providers frequently deploy:
- Additional spine switches
- Redundant uplinks
- Extra optical transceivers
- Larger fiber footprints
As network scale increases, these costs grow rapidly.
Concentrated Failure Domains #
Although Fat-tree architectures offer redundancy, upper-layer failures can still impact large portions of the network.
For example:
- A failed spine switch may significantly reduce available bandwidth.
- Maintenance operations often affect entire sections of the hierarchy.
- Capacity reductions tend to be localized and highly visible.
These limitations motivated researchers to explore alternative topologies.
๐ฌ Why Expander Graphs Have Long Been Considered Ideal #
For years, networking researchers have viewed expander graphs as one of the most promising architectures for large-scale distributed systems.
Unlike hierarchical networks, expander graphs distribute connectivity more uniformly across the fabric.
Key Advantages of Expander Networks #
High Connectivity #
Every subset of switches maintains extensive connectivity to the rest of the network.
Benefits include:
- Better bandwidth utilization
- Improved traffic distribution
- Reduced hotspot formation
- Greater path diversity
Strong Fault Tolerance #
Failures have a proportional impact rather than a concentrated one.
For example:
- A 1% device failure typically results in approximately 1% capacity loss.
- No single switch becomes a critical bottleneck.
- Network degradation remains predictable.
Lower Infrastructure Costs #
Expander designs can achieve similar throughput levels while eliminating significant portions of traditional aggregation infrastructure.
Potential savings include:
- Fewer switches
- Fewer optical links
- Reduced rack space
- Lower power consumption
Despite these advantages, expander-based networks remained largely confined to academic research for over a decade.
๐ง The Three Barriers Preventing Commercial Adoption #
Although concepts such as the Jellyfish topology demonstrated impressive theoretical performance, practical deployment faced three major challenges.
Routing Complexity #
Fat-tree networks benefit from simple routing mechanisms based largely on shortest-path forwarding.
Expander networks introduce:
- Large numbers of possible paths
- Dynamic traffic patterns
- Complex forwarding decisions
Many proposed solutions required enormous routing tables that exceeded the capabilities of commodity switching hardware.
Physical Cabling Challenges #
Randomized topologies create highly irregular connectivity patterns.
In a real data center this can result in:
- Long-distance fiber runs
- Complex cable management
- Difficult expansion procedures
- Increased deployment risk
Adding new racks often requires rewiring existing infrastructure.
Lack of Predictable Design Models #
Network architects require deterministic planning tools.
Traditional expander proposals often relied heavily on simulations rather than analytical models, making it difficult to answer practical questions such as:
- How many switches are required?
- How many uplinks should each switch have?
- What oversubscription ratio will result?
Without these answers, standardization becomes difficult.
๐ AWS RNG: Three Innovations That Changed the Equation #
AWS addressed each of these historical limitations through a combination of routing innovation, optical infrastructure design, and mathematical modeling.
๐ Spraypoint Routing #
At the heart of RNG is Spraypoint, a distributed routing algorithm designed specifically for large-scale expander-style fabrics.
Rather than maintaining extensive path information, Spraypoint uses a two-stage forwarding process.
Source Spraying #
When traffic enters the network, the ingress switch distributes packets across multiple uplinks using Equal-Cost Multi-Path (ECMP) hashing.
Source ToR
โ
โผ
Randomized ECMP Distribution
โ
โผ
Multiple Intermediate Paths
This immediately disperses traffic across the fabric.
Destination Convergence #
To prevent uncontrolled path wandering, AWS introduces destination-oriented waypoint guidance.
Source
โ
โผ
Randomized Distribution
โ
โผ
Waypoint Guidance
โ
โผ
Destination
Packets gradually converge toward predefined waypoint tiers near the destination.
The result is:
- High path diversity
- Low routing-table requirements
- Efficient load balancing
- Compatibility with commodity switching hardware
Topology Hiding #
A particularly elegant aspect of Spraypoint is its topology abstraction.
Switches do not need complete knowledge of the entire network.
Instead, they only determine:
- Relative destination tier
- Appropriate ECMP forwarding decision
This keeps routing state requirements comparable to conventional Fat-tree deployments.
๐ ShuffleBox: Solving the Cabling Problem #
Routing alone cannot solve physical deployment challenges.
To address infrastructure complexity, AWS developed ShuffleBox, a passive optical interconnection system.
Unlike active switching equipment, ShuffleBox contains:
- No processors
- No software
- No power requirements
Its purpose is purely organizational.
Core Architecture #
Each ShuffleBox includes:
- Local switch-facing ports (R-ports)
- Backbone connectivity ports (C-ports)
- Internal passive optical interconnections
Combined with companion ShuffleBack adapters, the system creates controlled randomized connectivity without requiring chaotic physical cabling.
Simplified Fabric Deployment #
Traditional random topologies often resemble a wiring nightmare.
ShuffleBox changes the model.
Local ToR Switches
โ
โผ
ShuffleBox
โ
โผ
Backbone Links
โ
โผ
Remote ShuffleBox
Benefits include:
- Short local cable runs
- Simplified deployment
- Predictable infrastructure layouts
- Easier expansion
When new halls are added, only a limited number of backbone connections require modification.
Existing switch wiring remains untouched.
๐ Mathematical Models Replace Guesswork #
One of the most significant contributions of RNG is the introduction of analytical performance models.
Rather than relying exclusively on simulation, AWS developed formulas capable of predicting network behavior.
Predictable Engineering Design #
Using switch count, port allocations, and routing parameters, architects can estimate:
- Independent path counts
- Path length distributions
- Oversubscription ratios
- Network capacity
This transforms RNG from a research concept into a repeatable engineering framework.
Path Length Efficiency #
Despite its randomized nature, AWS reports that most traffic traverses only:
- 3 hops
- 4 hops
In many scenarios, average path lengths are actually shorter than those found in traditional multi-tier Fat-tree fabrics.
๐ Production Deployment Results #
AWS deployed RNG in production environments across two primary categories:
- Server Mesh networks
- Edge Mesh infrastructures
The company then compared RNG directly against equivalent Fat-tree deployments using identical hardware resources.
โก Performance Under Real Workloads #
Testing focused on three common traffic patterns.
Clique Traffic #
Every node communicates with every other node.
Typical examples include:
- Distributed storage
- Data replication
- Analytics frameworks
Hub Traffic #
Many devices communicate with a small set of central systems.
Examples include:
- Databases
- API gateways
- Metadata services
Matching Traffic #
One-to-one communication patterns designed to expose congestion bottlenecks.
Performance Findings #
Under light utilization levels:
- Fat-tree maintained a small advantage.
- Performance differences ranged between 5% and 10%.
Under moderate and heavy workloads:
- RNG delivered up to 33% higher throughput.
- Bandwidth utilization improved significantly.
- Congestion handling became more efficient.
For cloud environments characterized by bursty multi-tenant traffic, RNG demonstrated clear advantages.
๐ฐ Cost and Infrastructure Savings #
One of RNG’s most compelling benefits is infrastructure efficiency.
Non-Blocking Deployments #
For fully non-blocking designs, AWS reported approximately:
- 9% infrastructure cost reduction
Commercial Oversubscribed Deployments #
For practical production environments using moderate oversubscription:
- Up to 45% lower hardware costs
Savings come primarily from reductions in:
- Aggregation switches
- Spine switches
- Optical transceivers
- Supporting infrastructure
Router Reduction #
AWS also reported up to:
- 69% fewer routers
compared with equivalent Fat-tree architectures.
๐ Addressing Random Topology Challenges #
Randomized networks can potentially increase latency due to longer physical paths.
AWS mitigated this through two optimizations.
Localized Waypoint Preference #
Routing algorithms prioritize nearby waypoints whenever possible.
Benefits include:
- Reduced hop distance
- Lower propagation delay
- Improved traffic locality
Controlled Backbone Connectivity #
AWS carefully limits long-distance inter-hall links while increasing local connectivity density.
This balances:
- Capacity
- Cost
- Latency
The result is latency performance that closely matches traditional Fat-tree networks.
๐ฆ Scaling During Incremental Deployment #
Another challenge involves partially populated facilities.
Early deployment phases often suffer from limited connectivity density.
AWS addressed this by introducing staged onboarding procedures.
Rather than connecting every rack immediately:
- Initial rack groups are deployed.
- Connectivity stabilizes.
- Additional racks are gradually integrated.
This strategy maintains high uplink utilization even when facilities are only partially occupied.
๐ง Operational Considerations #
Moving from a hierarchical architecture to a flat mesh introduces operational changes.
Tooling Adaptation #
Many cloud management platforms assume:
- Spine layers
- Aggregation layers
- Hierarchical fault domains
RNG requires updates to:
- Monitoring systems
- Troubleshooting tools
- Capacity planning software
Maintenance Workflows #
Traditional maintenance procedures often isolate entire rows or layers.
In a mesh network:
- Maintenance becomes more distributed.
- Upgrade strategies must adapt to graph-based connectivity.
- Operational automation becomes increasingly important.
These changes primarily affect software and operational processes rather than physical infrastructure.
๐ฏ Strategic Significance #
AWS’s RNG architecture represents more than an incremental networking improvement.
It demonstrates that large-scale expander networks can finally move beyond academic research and into commercial production.
Commercializing Expander Graphs #
For over a decade, expander topologies remained theoretically attractive but operationally impractical.
RNG establishes a viable path toward large-scale deployment.
Redefining Cloud Economics #
By simultaneously improving throughput and reducing infrastructure costs, RNG fundamentally changes the cost-performance equation for hyperscale cloud providers.
A Foundation for Future Compute Clusters #
Although AWS currently targets general-purpose cloud workloads, RNG introduces concepts that could influence future architectures for:
- AI infrastructure
- Large-scale inference clusters
- Distributed training environments
- High-performance computing systems
As compute clusters continue growing in size and complexity, flat graph-based networking may become an increasingly attractive alternative to traditional hierarchical fabrics.
๐ฎ Conclusion #
AWS’s Resilient Network Graphs architecture represents one of the most significant data center networking innovations in recent years. By combining the theoretical strengths of expander graphs with practical solutions such as Spraypoint routing, ShuffleBox optical infrastructure, and analytical design models, AWS has transformed a long-standing academic concept into a production-ready cloud networking platform.
The results are compelling: lower infrastructure costs, fewer networking devices, improved resilience, and stronger performance under real-world cloud traffic patterns. More importantly, RNG challenges the assumption that hierarchical Clos and Fat-tree fabrics are the inevitable end state for hyperscale networking.
As cloud providers continue searching for more efficient ways to scale infrastructure, RNG may prove to be the beginning of a broader shift toward flat, graph-based network architectures capable of supporting the next generation of distributed computing systems.