Supercomputing NewsBeta
AIHPCQuantumEmerging
Sign inSubscribe
Supercomputing News
Pillars
AI—HPC—Quantum—Emerging—
Theme
Sign inSubscribe
Supercomputing News

Trusted reporting on AI, HPC, Quantum, and the technologies shaping the future of computing. Cryptographically signed. Agent-accessible.

Pillars

  • Artificial Intelligence
  • High-Performance Computing
  • Quantum Computing
  • Emerging Technology

Publication

  • About
  • Topics
  • Contact

Weekly Update

Keep track of the biggest stories in supercomputing, every Thursday.

Subscribe for free today
© 2026 Supercomputing News
Privacy PolicyTerms of Use
Artificial IntelligenceAINews

Inside Meta's 83,000-GPU AI Supercomputer: Why It Runs the Silicon at 80% Power on Purpose

Meta's first end-to-end account of running a 150 MW, 83,000-GB200 cluster - when power is the ceiling, the cluster, not the chip, is what you optimize.

Isometric illustration of a 72-GPU Catalina pod. A small front section glows red at full 1200W power; the rest of the pod, packed more densely, glows in a calm blue-green at 960W.
The 80% rule, in one pod: chips running flat-out at 1200W (front, red) take up more space per accelerator and run hotter. The same pod at Meta's 960W operating point (rest, accent-lit) fits more GPUs in the same envelope and delivers more aggregate throughput per megawatt. Restraint at the chip level buys density at the cluster level.Supercomputing News
SCN Staff
Staff Editor
Published
May 30, 2026
Reading0%

Meta has published what it calls the first end-to-end account of power management for a hyperscale AI supercomputer. The window it covers is wide: purchasing decisions locked six to twelve months before silicon ships, down to runtime throttling of live training jobs. The paper, Provisioning to Runtime Optimization of a 100 MW-Scale AI Cluster (arXiv:2605.24461), was submitted May 23 and revised May 26. It describes a 150 MW facility running roughly 83,000 NVIDIA GB200 GPUs - one slice of a larger 1 GW buildout.

For this class of hyperscale buildout, the authors argue, the binding constraint on scaling is now power, not chips. Meta's engineers note that total U.S. utility-scale net summer capacity grew only about 41 GW between 2023 and 2024, from roughly 1,189 to 1,230 GW. The largest announced hyperscale compute commitments now approach that scale, though overlapping partnerships and phased timelines make any direct comparison loose. So the objective of datacenter power planning inverts. It is no longer to minimize the electricity bill; it is to maximize the compute that fits under a fixed megawatt ceiling, because the capex sunk into the facility and the accelerators dwarfs whatever the utility charges to run them. That is the same arithmetic driving Denmark's confrontation with its own grid and the reason xAI bet it could outrun grid build times faster than regulators could object. Meta's contribution is to put the operating numbers on the table.

The 80% rule

The most quotable result is also the most counterintuitive. Rather than provisioning to each GB200's 1,200 W thermal design power, Meta's finalized provisioning table uses 960 W per GB200 — 80% of TDP — after its modeling put the cluster-level performance-per-watt optimum near 1,000 W. Either way, the machine is tuned for throughput-per-watt across the floor, not peak performance per chip.

The logic lives in the non-linearity. Dropping a GB200 from 1,200 W to 1,000 W costs about 5% of per-GPU performance while cutting power draw by nearly 17%; push down to 900 W and you lose 12% of performance for a 25% power cut. Per-GPU, that looks like a bad trade. At cluster scale it is the opposite, because the power you reclaim per accelerator buys you more accelerators inside the same envelope. Meta's own provisioning table makes the mechanism explicit: under a fixed power budget, the 960 W operating point fits about 86,000 GB200s where running them flat-out at 1,200 W would fit only around 74,000. Fewer watts per GPU, more GPUs on the floor.

The aggregate result follows from that swap. At 960 W the supercomputer delivers 1.9× the throughput of an equivalent H100 deployment under the same power ceiling, versus 1.7× if the same GB200s were run at full 1,200 W... roughly an 11% throughput gain handed back by deliberately under-clocking. Per-GPU, the underclocked parts are marginally slower than their flat-out siblings (2.4× an H100 versus 2.5×). The cluster wins anyway, because there are more of them.

Under a fixed megawatt ceiling, underclocking each GB200 to 80% fits ~12,000 more accelerators and lifts cluster throughput from 1.7× to 1.9× an equal-power H100 deployment.Meta / arXiv:2605.24461v2

There is a floor to how far this goes, and it is memory bandwidth. HBM throughput holds essentially flat from 1,200 W down to 1,000 W, then drops sharply - about 15% - when power is pushed to 800 W. That cliff bounds the strategy: shave too aggressively and the bandwidth-bound phases of training fall off before the compute does. (The result lands in the same week the field has been arguing that HBM is allocation-constrained, not supply-constrained - and reinforces it, since memory bandwidth is what sets the practical floor on underclocking.)

For the chip vendors, the read-through is quieter but real. Meta says it has applied the same provisioning methodology to AMD GPUs and to an internal AI accelerator, a sign the pattern is spreading beyond NVIDIA. Buyers at this scale want multiple power-limit operating points, not a single max-performance SKU. They are not running these parts at the top of the curve, and Meta's numbers say they are right not to.

Catalina, and a back-end network built for scale

The 80% rule is a knob. The harder engineering is in the machine it turns. Meta's pod is its Open Compute "Catalina" design, and it diverges from NVIDIA's GB200 reference in ways that matter at this scale. Two IT racks of 36 GPUs each combine into a single 72-GPU NVLink domain, and the host ratio runs leaner than NVIDIA's reference: two Grace CPUs paired to two GPUs per tray, against the reference 4+2.

A Catalina-based GB200 pod with two interconnected IT racks and two AALC per rack, hosting 72 GB200 GPUs.
A Catalina-based GB200 pod with two interconnected IT racks and two AALC per rack, hosting 72 GB200 GPUs.Meta / arXiv:2605.24461v2

The more consequential change is the back-end fabric. Meta doubled scale-out bandwidth to 100 GB/s per GPU - twice the 50 GB/s of the reference topology - by hanging two 400G ConnectX-7 NICs off each Grace CPU, for 800 Gbps of RDMA per GPU. That is not bandwidth for its own sake. The paper's scaling analysis shows the advantage of the redesigned fabric widening as the job grows: the bigger the supercomputer, the more the doubled back-end pays off, because collective-communication phases are exactly where a frontier-scale training run either stays fed or stalls. The same physical-layer pressure shows up one level down, in power delivery, where the industry's move to 800V DC racks is rewiring the last fifty feet to the accelerator.

The cluster at a glance

Attribute

Value

Datacenter IT power

150 MW (part of a larger 1 GW buildout)

Accelerators

~83K NVIDIA GB200 GPUs

Cooling

Air-cooled facility, no facility chilled water; rack-side Air Assisted Liquid Cooling (AALC)

Building topology

5 buildings × ~30 MW; 4 data halls/building; 3 MSBs/hall

Rack platform

OCP "Catalina" pod (diverges from NVIDIA GB200 reference)

Catalina config

2 IT racks × 36 GPUs → 72-GPU NVLink scale-up domain (2×36)

Host ratio

2+2 GPU+CPU (1 CPU/GPU), vs reference 4+2

Back-end network

RDMA, 800 Gbps/GPU; 2× 400G CX7 NIC per Grace CPU; 100 GB/s scale-out/GPU (vs 50 GB/s reference)

Front-end network

TCP/IP Ethernet, 200 Gbps/GPU

GB200 power: TDP

1200 W

GB200 power: Perf/Watt-optimal

960 W (80% TDP) at provisioning

GB200 power: operational

raised to 1020 W after P70 telemetry correction

Where the power hides

Meta's telemetry disclosures are unusually candid for a hyperscaler, and they are the part of the paper an operator will dog-ear.

PSU telemetry overstates rack power — predictably. Power-supply-unit readings systematically run high against true consumption, because PSUs are tuned to err conservative and never under-report. Cross-checking against oscilloscopes and rack-panel DCIM sensors, Meta found that the 70th percentile of per-minute PSU samples — they call it P70 — closely tracks the DCIM ground truth. Adopting that correction let them lift the operational limit from 960 W to 1,020 W for around 2% throughput upside. Performance recovered from better measurement rather than new silicon.

Heterogeneity strands 5–10% of the power budget. A real datacenter is a messy mix of GPU, cooling, network, and support racks, and uneven placement leaves wildly different headroom across the delivery hierarchy. The binding layer turns out to be the main switchboard (MSB): mean MSB headroom works out to roughly 100 W per GPU, and 13% of MSBs run with under 50 kW of buffer. One level down, the reactor power panels (RPPs) that feed the racks directly carry ample margin - their mean headroom exceeds 200 W per GPU against a 197.5 kW rating - so they are not the bottleneck. Whichever element in the path has the least slack caps the entire training job. Here that is the switchboard, not the panel.

Synchronous training is a grid-stability risk. The communication phases of large synchronous jobs create coordinated power dips that, Meta says, aggregate into datacenter-scale swings the company warns may threaten utility-grid stability and the power-delivery gear itself. Its answer is an always-on software "power smoother" that gap-fills the dips. Meta says it runs the smoother continuously because telemetry-based triggering is too slow to catch the swings in time.

Two named systems

The smoother is the first of two production runtime systems, and the more brute-force of them. It runs continuously, injecting register-only dummy work on the Tensor Cores... nearly 800 W per GB200 of synthetic load when a real workload would otherwise let power collapse. Overhead lands under 3%, toggled by a single environment variable. Meta says it rejected the obvious alternatives on purpose: event-based triggering would have meant invasive hooks into PyTorch, and telemetry-based triggering is too coarse to catch the dips in time. So the smoother simply never stops.

Dimmer is the more interesting system, because it promotes power-capping from a safety backstop to an optimizer. When a power device crosses 97% of its limit, Dimmer trims power evenly across every GPU under that device, reclaiming a total of P watts as P/N across N GPUs, rather than evicting jobs or hard-dropping clocks on a few unlucky racks. Spreading the cut keeps any one GPU from becoming the straggler that holds up a synchronous step.

The seven-second averaging window is matched to circuit-breaker trip curves. A reactor power panel can tolerate a 10% overdraw for about 17 minutes but trips within 60 seconds at 40% over; a main switchboard tolerates 1.2× for roughly 45 seconds and 2× for about 30. A genuinely dangerous overage is sustained, not instantaneous. A seven-second window lets harmless transient spikes pass untouched and acts only on overdraws long enough to actually threaten the breaker. One second would throttle on noise. Sixty would be too late.

Neither system is unprecedented, and Meta does not claim otherwise. Microsoft's August 2025 paper, Power Stabilization for AI Training Datacenters (arXiv:2508.14318), covered power swings and mitigation across software, GPU hardware, and datacenter infrastructure, and the Meta paper cites it directly. The genuinely new ground here is the provisioning and planning phase — the six-to-twelve-month, pre-silicon decisions — and the fact that the runtime layer has been built and measured, not just described.

The fine print

A few things temper the "first end-to-end" claim, which is Meta's own. The authors hedge it as "to our knowledge," and we keep it attributed. The novelty is the published, end-to-end framing plus the planning-phase disclosure, not the invention of power stabilization; Microsoft got to the runtime levers first. Some of the performance results are model- and projection-based, too: parts of the provisioning analysis rest on Meta's internal graph-execution simulations and projected performance curves rather than a publicly reproducible full-cluster benchmark. Several key line items in Meta's rack-power model - among them the NIC, optics, and fabric-management figures - are redacted as confidential, so the network and optics share of the rack budget cannot be independently reconstructed; the back-end network's roughly 11% of IT-rack power is given, but the optics breakdown is not. And the facility has no facility-based liquid cooling or chilled-water plant. Each Catalina compute rack is paired with two air-assisted liquid cooling racks that handle the high-density heat at the rack side. That suggests the 960 W operating point may be shaped by the thermal envelope as much as by the performance-per-watt argument Meta foregrounds. The paper does not say so; the inference is ours, and it stays an inference.

Caveats aside, the disclosure is a rare quantified look inside a frontier-scale AI supercomputer from an operator that usually says nothing, and it hands every AI-factory builder a defensible operating rule. When power is the ceiling, you do not run the silicon flat out. You run it where the cluster peaks, and you publish the table that proves where that is.

AI InfrastructureNVIDIAPower & EnergyHyperscaler Strategy
AI disclosure
AI-assisted research and first draft. This article has been verified by a human editor.
Related reading
AI · AnalysisThe 800V DC Rack Transition: How Rubin Ultra Is Rewiring the Supercomputing Industry's Last 50 FeetAI · AnalysisAI Training Power Demand Is Outrunning Grid Build Times. xAI Bet It Could Outrun Regulators Too.AI · NewsMRC Gives Open Ethernet Its First 75,000-GPU Production Proof Point