Emerging TechnologyEmergingAnalysis

AWS quietly retired the fat tree. Fifty-year-old graph theory took its place.

By April 2026, Amazon's random-graph fabric had become the default for most new AWS datacenters. The efficiency claims behind it are still Amazon's own, with no independent benchmark yet.

Abstract conceptual illustration of a glowing random-graph network rendered as luminous optical fibers, with nodes lit where strands of light intersect, on a dark reflective surface. — Decades of random-graph theory become physical infrastructure: conceptual image of a flat mesh of light standing in for the quasi-random optical fabric at the heart of AWS's RNG design.AI-generated / Supercomputing News

Reported analysis. Design details, deployment dates, and performance figures are drawn from the RNG preprint (arXiv:2604.15261), Amazon's own corporate channels, and the cited lineage papers, with every number attributed to its source. Where a figure comes only from Amazon or from an AWS-affiliated commentator and has no independent benchmark behind it, this piece says so. The "theory-to-production" framing is our interpretation, labeled as such.

The switch that organizes a hyperscale network has had the same shape for forty years. You build a tree, fatten the links as you climb toward the root so the upper tiers do not starve, and route every cross-machine flow up to a common ancestor and back down. Charles Leiserson formalized the fat-tree in 1985, and it descends from the multistage Clos fabric that Bell Labs' Charles Clos described for non-blocking telephone switching in 1953. The modern folded-Clos datacenter network is a direct heir of that 1950s switching theory. When datacenter operators went looking for a way to wire thousands of commodity switches into one machine, they reached for the fat-tree, and they have been building variations on it ever since.

Amazon took a different path. In a preprint posted to arXiv on April 16, "RNG: Flat Datacenter Networks at Scale" opens with a claim the authors state plainly: "We design and deploy in production the first flat datacenter networks." That "first" is the paper's own superlative, not an independently established fact. The design is called RNG, it is built on quasi-random graphs, and per the abstract it "is now the default datacenter network for most workloads at Amazon." The authors are an AWS team working with academic collaborators, including an expander-graph theorist. The mathematics is not incidental to the result.

The flat network is not a clever piece of new engineering so much as the moment a fifty-year-old line of theoretical computer science reached hyperscale production. And it is exactly the question this publication is committed to tracking: which emerging architectures move from frontier to foundation of the supercomputing stack, and which stay interesting research that never reached scale.

What a flat network buys you

Strip away the hierarchy and a fat-tree's logic inverts. Instead of routers arranged in tiers, RNG wires them into something close to a random regular graph: every router connected to a near-uniform spread of others, no root, no levels. The appeal is structural. In a tree, the switch count balloons as you add the aggregation and spine tiers needed to keep the upper links from oversubscribing. A flat graph reaches the same endpoints with far less hardware in between.

The paper's defensible cost claim is a modeled topology claim, not a full TCO result: using switch count as the cost proxy, §9.4 reports RNG topologies are between 9% and 45% cheaper than fat trees at equivalent oversubscription. The saving varies roughly fivefold across configurations, reaching "45% fewer switches" at the worst-case 3:1 oversubscription comparison. The abstract states the same headline: "RNG matches or exceeds the performance of fat trees for a range of traffic patterns, despite being up to 45% cheaper." It is a switch-count result rather than a measured capex or full operating-cost figure, and the paper itself flags limits around the passive optical components. It remains the most rigorous number attached to RNG, because it sits in the evaluation rather than the marketing.

Two engineering problems had long kept random graphs out of production, and RNG names a fix for each. The first is cabling. A truly random wiring pattern is a nightmare to physically install and maintain at the scale of a datacenter hall. RNG's answer is the ShuffleBox, which the paper describes as "a novel passive optical device that internally shuffles cables, which makes its cabling complexity similar to that of fat trees." It is a passive optical device, drawing no power of its own; the shuffle happens inside the glass. That places it squarely in the optical-interconnect supply chain that has been consolidating fast, the same supply chain SCN covered when Credo and Molex locked down silicon photonics in 48 hours.

The second problem is routing. A tree tells you where packets go; a random graph does not, and naïve routing wastes the path diversity that makes the topology worth building. RNG's routing protocol, Spraypoint, is, per the paper, "a new fully-distributed routing protocol that exploits the properties of random graphs to find a large number of edge-disjoint paths between pairs of endpoints." Edge-disjoint is the operative phrase: many independent paths between any two machines, computed from the graph's expander properties, which is what lets a flat fabric soak up the all-to-all traffic of a large training run without a hierarchy to funnel it through.

Side-by-side network topology schematic. Left: a tiered fat-tree/Clos fabric with multiple parallel spine and aggregation switches and many upward paths. Right: a flat RNG random regular graph wiring the switch layer, with servers and racks attached to those switches. — Left: the tiered fat-tree/Clos fabric. Right: RNG's flat quasi-random graph at the switch layer. Schematic, not to scale; efficiency claims are Amazon's, not independently verified.AI-generated / Supercomputing News

The fifty-year line behind it

The reason a random graph can match a carefully engineered tree comes out of expander-graph theory, and the relevant results are old. Random regular graphs are excellent expanders: sparse, yet so well connected that any two nodes sit a short hop apart and the bisection bandwidth stays high. The mathematical guarantee was later sharpened by Joel Friedman's proof of Alon's second eigenvalue conjecture, published in full in 2008, showing that random regular graphs are almost Ramanujan with high probability. That spectral gap is, in effect, the promise that the network will not have a hidden bottleneck.

The idea of wiring a datacenter this way is not new either. In 2012, researchers proposed Jellyfish: Networking Data Centers Randomly at NSDI, connecting switches as a random regular graph and showing it could support more servers at the same cost than a fat-tree. The RNG paper credits it directly. Jellyfish was a strong result that stalled on the two practical walls above, cabling and routing, which is precisely the gap RNG's ShuffleBox and Spraypoint are built to close. Further back still sits the randomized load-balancing that VL2 brought to datacenter networks at SIGCOMM in 2009.

So the arc runs from 1970s expander theory through Friedman's 2008 proof to Jellyfish's 2012 proposal and, finally, to a production fabric in 2024. That synthesis is our framing, not Amazon's, but each step in it is a real, citable result.

Whose numbers are whose

This is where the piece has to be careful, because RNG's performance claims come from three sources that say different things, and only one of them is the peer-style paper.

The 45%-cheaper / 45%-fewer-switches figure is the paper's, as above. A second, more eye-catching set of numbers comes from Amazon's own corporate channels. The Amazon Science blog, "How 'flat' is replacing 'fat' in AWS data center networks," states that RNG "uses 69% fewer routers, delivers up to 33% better throughput, and projects a 40% reduction in network equipment electricity consumption." Those are Amazon's claims for a specific production comparison, and they are not interchangeable with the paper's. "69% fewer routers" is a different metric than "45% fewer switches," against a different baseline; blending the two into a single statistic would misstate both. The throughput figure is a corporate "up to 33%," where the paper says only that RNG matches or exceeds fat-trees across a range of patterns, with no percentage. The power figure is a projection about network-equipment electricity, not a measured facility-wide saving. Taken at Amazon's word, though, it is the kind of claim that makes RNG a Power Question as much as a topology one.

A third number should be handled with tongs. A widely circulated "27% lower operating costs" appears only in a blog post by James Hamilton, who calls RNG arguably the biggest datacenter-network shift since the fat-tree era of the 1980s. Two disclosures are owed here. Hamilton is a co-author of the 2009 VL2 paper and a longtime AWS engineer: an insider commenting on an AWS result, not a neutral party. And the 27% opex figure is not in the preprint and not in the Amazon Science blog; it is, as far as the primary documents show, unverified. His "biggest shift since the 1980s" line is his analysis, and a credible practitioner's read carries weight, but it is a judgment, not a benchmark.

The honest summary: no independent, third-party benchmark of RNG has been published that we found. Every performance number traces back to Amazon or to an AWS-affiliated commentator, and Network World, covering the design, likewise notes that the claimed efficiencies "have not been independently verified." Amazon says it validated its models with "530 processor-years of simulation," that the first quasi-random fabric went live near Dublin at the end of 2024, and that by April 2026 the design had become the default for most new AWS data centers globally. Those are Amazon's figures, reported as Amazon's. The paper's 45% is the one number with a methods section behind it.

The substrate question

Set the marketing expansion aside, too. The paper never spells "RNG" out as "Resilient Network Graphs." That gloss lives only in Amazon's blogs and Hamilton's post. Spraypoint and ShuffleBox are the paper's own terms; the friendly acronym expansion is brand language.

What is left, stripped of the corporate numbers, is still a significant fact about the supercomputing stack. A fabric is the substrate that turns tens of thousands of accelerators into a single training machine, and many large AI training workloads are sensitive to collective and all-to-all communication performance. That makes RNG relevant to the broader AI infrastructure question, though Amazon's public RNG claims describe general-purpose AWS data centers and new builds, not GPU training-cluster fabrics, and should not be read as an independent benchmark of those clusters. The dominant answer to the interconnect problem for four decades has been a hierarchy. Most of the industry is still building hierarchical Clos and fat-tree fabrics today. RNG is the scale-out counterpart to the scale-up interconnect fight playing out one tier down, inside the rack, where co-packaged optics is contesting the scale-up bottleneck; flatten the network between racks and you still have to feed the GPUs within one.

Whether RNG is a genuine inflection or one hyperscaler's well-optimized bet is not yet answerable from the outside, and pretending otherwise would be the easy mistake. What is answerable is narrower and more interesting: a body of theory that spent fifty years as elegant mathematics is now carrying production traffic inside one of the world's largest cloud networks. By the test this publication applies to the substrate question, frontier or foundation, expander graphs just moved a long way toward foundation, on the strength of one company's deployment and one preprint's evaluation. The rest of the industry's response over the next year is the part worth watching.

AI Infrastructure Hyperscaler Strategy

About the contributor

The SCN Staff is a small AI editorial squad working under human direction. Each agent owns one job.

Scout does the research. It runs down primary sources and checks what's already been published, on SCN and everywhere else, before a story gets written. If a claim can't be traced back to a real document, Scout flags it.

Forge writes. It takes what Scout found and turns it into a draft, argument and sentences and all. Every SCN piece starts here, then gets sharpened.

Cipher handles search: the titles, descriptions, and keyphrase work that decides whether a good article ever gets found. Least glamorous job on the squad. Also one that matters more than it looks.

Pixel makes the visuals. Images, charts, the occasional diagram, all built to SCN's brand instead of pulled from a stock library. When something's easier to see than to read, it goes to Pixel.

Editorial judgment and the final call stay with the humans. So does the fact-checking.

What a flat network buys you

Left: the tiered fat-tree/Clos fabric. Right: RNG's flat quasi-random graph at the switch layer. Schematic, not to scale; efficiency claims are Amazon's, not independently verified.AI-generated / Supercomputing News

The fifty-year line behind it

Whose numbers are whose

This is where the piece has to be careful, because RNG's performance claims come from three sources that say different things, and only one of them is the peer-style paper.

The substrate question