Meta's first end-to-end account of running a 150 MW, 83,000-GB200 cluster - when power is the ceiling, the cluster, not the chip, is what you optimize.

Meta has published what it calls the first end-to-end account of power management for a hyperscale AI supercomputer. The window it covers is wide: purchasing decisions locked six to twelve months before silicon ships, down to runtime throttling of live training jobs. The paper, Provisioning to Runtime Optimization of a 100 MW-Scale AI Cluster (arXiv:2605.24461), was submitted May 23 and revised May 26. It describes a 150 MW facility running roughly 83,000 NVIDIA GB200 GPUs - one slice of a larger 1 GW buildout.
For this class of hyperscale buildout, the authors argue, the binding constraint on scaling is now power, not chips. Meta's engineers note that total U.S. utility-scale net summer capacity grew only about 41 GW between 2023 and 2024, from roughly 1,189 to 1,230 GW. The largest announced hyperscale compute commitments now approach that scale, though overlapping partnerships and phased timelines make any direct comparison loose. So the objective of datacenter power planning inverts. It is no longer to minimize the electricity bill; it is to maximize the compute that fits under a fixed megawatt ceiling, because the capex sunk into the facility and the accelerators dwarfs whatever the utility charges to run them. That is the same arithmetic driving Denmark's confrontation with its own grid and the reason xAI bet it could outrun grid build times faster than regulators could object. Meta's contribution is to put the operating numbers on the table.
The most quotable result is also the most counterintuitive. Rather than provisioning to each GB200's 1,200 W thermal design power, Meta's finalized provisioning table uses 960 W per GB200 — 80% of TDP — after its modeling put the cluster-level performance-per-watt optimum near 1,000 W. Either way, the machine is tuned for throughput-per-watt across the floor, not peak performance per chip.
The logic lives in the non-linearity. Dropping a GB200 from 1,200 W to 1,000 W costs about 5% of per-GPU performance while cutting power draw by nearly 17%; push down to 900 W and you lose 12% of performance for a 25% power cut. Per-GPU, that looks like a bad trade. At cluster scale it is the opposite, because the power you reclaim per accelerator buys you more accelerators inside the same envelope. Meta's own provisioning table makes the mechanism explicit: under a fixed power budget, the 960 W operating point fits about 86,000 GB200s where running them flat-out at 1,200 W would fit only around 74,000. Fewer watts per GPU, more GPUs on the floor.
The aggregate result follows from that swap. At 960 W the supercomputer delivers 1.9× the throughput of an equivalent H100 deployment under the same power ceiling, versus 1.7× if the same GB200s were run at full 1,200 W... roughly an 11% throughput gain handed back by deliberately under-clocking. Per-GPU, the underclocked parts are marginally slower than their flat-out siblings (2.4× an H100 versus 2.5×). The cluster wins anyway, because there are more of them.
There is a floor to how far this goes, and it is memory bandwidth. HBM throughput holds essentially flat from 1,200 W down to 1,000 W, then drops sharply - about 15% - when power is pushed to 800 W. That cliff bounds the strategy: shave too aggressively and the bandwidth-bound phases of training fall off before the compute does. (The result lands in the same week the field has been arguing that HBM is allocation-constrained, not supply-constrained - and reinforces it, since memory bandwidth is what sets the practical floor on underclocking.)
For the chip vendors, the read-through is quieter but real. Meta says it has applied the same provisioning methodology to AMD GPUs and to an internal AI accelerator, a sign the pattern is spreading beyond NVIDIA. Buyers at this scale want multiple power-limit operating points, not a single max-performance SKU. They are not running these parts at the top of the curve, and Meta's numbers say they are right not to.
The 80% rule is a knob. The harder engineering is in the machine it turns. Meta's pod is its Open Compute "Catalina" design, and it diverges from NVIDIA's GB200 reference in ways that matter at this scale. Two IT racks of 36 GPUs each combine into a single 72-GPU NVLink domain, and the host ratio runs leaner than NVIDIA's reference: two Grace CPUs paired to two GPUs per tray, against the reference 4+2.

The more consequential change is the back-end fabric. Meta doubled scale-out bandwidth to 100 GB/s per GPU - twice the 50 GB/s of the reference topology - by hanging two 400G ConnectX-7 NICs off each Grace CPU, for 800 Gbps of RDMA per GPU. That is not bandwidth for its own sake. The paper's scaling analysis shows the advantage of the redesigned fabric widening as the job grows: the bigger the supercomputer, the more the doubled back-end pays off, because collective-communication phases are exactly where a frontier-scale training run either stays fed or stalls. The same physical-layer pressure shows up one level down, in power delivery, where the industry's move to 800V DC racks is rewiring the last fifty feet to the accelerator.
Attribute | Value |
Datacenter IT power | 150 MW (part of a larger 1 GW buildout) |
Accelerators | ~83K NVIDIA GB200 GPUs |
Cooling | Air-cooled facility, no facility chilled water; rack-side Air Assisted Liquid Cooling (AALC) |
Building topology | 5 buildings × ~30 MW; 4 data halls/building; 3 MSBs/hall |
Rack platform | OCP "Catalina" pod (diverges from NVIDIA GB200 reference) |
Catalina config | 2 IT racks × 36 GPUs → 72-GPU NVLink scale-up domain (2×36) |
Host ratio | 2+2 GPU+CPU (1 CPU/GPU), vs reference 4+2 |
Back-end network | RDMA, 800 Gbps/GPU; 2× 400G CX7 NIC per Grace CPU; 100 GB/s scale-out/GPU (vs 50 GB/s reference) |
Front-end network | TCP/IP Ethernet, 200 Gbps/GPU |
GB200 power: TDP | 1200 W |
GB200 power: Perf/Watt-optimal | 960 W (80% TDP) at provisioning |
GB200 power: operational | raised to 1020 W after P70 telemetry correction |
Meta's telemetry disclosures are unusually candid for a hyperscaler, and they are the part of the paper an operator will dog-ear.
PSU telemetry overstates rack power — predictably. Power-supply-unit readings systematically run high against true consumption, because PSUs are tuned to err conservative and never under-report. Cross-checking against oscilloscopes and rack-panel DCIM sensors, Meta found that the 70th percentile of per-minute PSU samples — they call it P70 — closely tracks the DCIM ground truth. Adopting that correction let them lift the operational limit from 960 W to 1,020 W for around 2% throughput upside. Performance recovered from better measurement rather than new silicon.
Heterogeneity strands 5–10% of the power budget. A real datacenter is a messy mix of GPU, cooling, network, and support racks, and uneven placement leaves wildly different headroom across the delivery hierarchy. The binding layer turns out to be the main switchboard (MSB): mean MSB headroom works out to roughly 100 W per GPU, and 13% of MSBs run with under 50 kW of buffer. One level down, the reactor power panels (RPPs) that feed the racks directly carry ample margin - their mean headroom exceeds 200 W per GPU against a 197.5 kW rating - so they are not the bottleneck. Whichever element in the path has the least slack caps the entire training job. Here that is the switchboard, not the panel.
Synchronous training is a grid-stability risk. The communication phases of large synchronous jobs create coordinated power dips that, Meta says, aggregate into datacenter-scale swings the company warns may threaten utility-grid stability and the power-delivery gear itself. Its answer is an always-on software "power smoother" that gap-fills the dips. Meta says it runs the smoother continuously because telemetry-based triggering is too slow to catch the swings in time.
The smoother is the first of two production runtime systems, and the more brute-force of them. It runs continuously, injecting register-only dummy work on the Tensor Cores... nearly 800 W per GB200 of synthetic load when a real workload would otherwise let power collapse. Overhead lands under 3%, toggled by a single environment variable. Meta says it rejected the obvious alternatives on purpose: event-based triggering would have meant invasive hooks into PyTorch, and telemetry-based triggering is too coarse to catch the dips in time. So the smoother simply never stops.
Dimmer is the more interesting system, because it promotes power-capping from a safety backstop to an optimizer. When a power device crosses 97% of its limit, Dimmer trims power evenly across every GPU under that device, reclaiming a total of P watts as P/N across N GPUs, rather than evicting jobs or hard-dropping clocks on a few unlucky racks. Spreading the cut keeps any one GPU from becoming the straggler that holds up a synchronous step.
The seven-second averaging window is matched to circuit-breaker trip curves. A reactor power panel can tolerate a 10% overdraw for about 17 minutes but trips within 60 seconds at 40% over; a main switchboard tolerates 1.2× for roughly 45 seconds and 2× for about 30. A genuinely dangerous overage is sustained, not instantaneous. A seven-second window lets harmless transient spikes pass untouched and acts only on overdraws long enough to actually threaten the breaker. One second would throttle on noise. Sixty would be too late.
Neither system is unprecedented, and Meta does not claim otherwise. Microsoft's August 2025 paper, Power Stabilization for AI Training Datacenters (arXiv:2508.14318), covered power swings and mitigation across software, GPU hardware, and datacenter infrastructure, and the Meta paper cites it directly. The genuinely new ground here is the provisioning and planning phase — the six-to-twelve-month, pre-silicon decisions — and the fact that the runtime layer has been built and measured, not just described.
A few things temper the "first end-to-end" claim, which is Meta's own. The authors hedge it as "to our knowledge," and we keep it attributed. The novelty is the published, end-to-end framing plus the planning-phase disclosure, not the invention of power stabilization; Microsoft got to the runtime levers first. Some of the performance results are model- and projection-based, too: parts of the provisioning analysis rest on Meta's internal graph-execution simulations and projected performance curves rather than a publicly reproducible full-cluster benchmark. Several key line items in Meta's rack-power model - among them the NIC, optics, and fabric-management figures - are redacted as confidential, so the network and optics share of the rack budget cannot be independently reconstructed; the back-end network's roughly 11% of IT-rack power is given, but the optics breakdown is not. And the facility has no facility-based liquid cooling or chilled-water plant. Each Catalina compute rack is paired with two air-assisted liquid cooling racks that handle the high-density heat at the rack side. That suggests the 960 W operating point may be shaped by the thermal envelope as much as by the performance-per-watt argument Meta foregrounds. The paper does not say so; the inference is ours, and it stays an inference.
Caveats aside, the disclosure is a rare quantified look inside a frontier-scale AI supercomputer from an operator that usually says nothing, and it hands every AI-factory builder a defensible operating rule. When power is the ceiling, you do not run the silicon flat out. You run it where the cluster peaks, and you publish the table that proves where that is.