NVIDIA's Vera Rubin Is a Capex Grenade - and Every Hyperscaler's 2027 Budget Knows It

The Blackwell-to-Rubin transition is a forcing function for the entire data center industry.

When Jensen Huang took the stage at GTC 2026 on March 16, the centerpiece wasn’t a surprise. NVIDIA had already tipped its hand: the Vera Rubin platform, VR200 GPUs in NVL72 rack-scale configurations, is the company’s answer to an inference economy that’s outgrowing Blackwell before Blackwell has even fully saturated the market.

The specs tell one story. The economics tell a more important one.

The hardware: Rubin by the numbers

Let’s start with what NVIDIA is claiming, because the numbers are staggering. Each Rubin GPU delivers roughly 50 petaflops of FP4 inference and 35 petaflops of FP4 training. Per chip. The NVL72 rack — 72 Rubin GPUs paired with 36 Vera CPUs over NVLink 6 — aggregates to 3.6 exaFLOPS of NVFP4 inference and 2.5 exaFLOPS of training in a single rack.

Memory is where it gets especially interesting for inference workloads: eight HBM4 stacks per GPU totaling approximately 288 GB at 22 TB/s of bandwidth. Across the full NVL72, that’s roughly 20.7 TB of HBM4 at 1.6 PB/s. NVIDIA claims this represents a 2.8x bandwidth improvement over Blackwell, though some developer documents cite 2.4x. That discrepancy matters when you’re modeling KV cache behavior at million-token context windows.

NVIDIA’s headline claim: up to 5x inference performance and 3.5x training performance versus Blackwell, with up to 10x lower cost per token for massive-scale inference and the ability to train mixture-of-experts models with four times fewer GPUs.

The architecture itself is a six-chip codesign: Vera CPU, Rubin GPU, NVLink 6 switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet switch, all built on 3nm process with HBM4. It’s a full-stack rethinking of what a compute node looks like.

The real story: inference economics are rewriting capex plans

What matters more than any FLOPS number: NVIDIA is actively shifting manufacturing capacity away from Blackwell toward Rubin. The company is reorienting its fab allocation and supply chain toward token-efficient inference hardware rather than concentrating solely on training throughput.

This is a strategic bet with massive implications. Huang said it plainly at CES: “Rubin arrives at exactly the right moment, as AI computing demand for both training and inference is going through the roof.” And more pointedly: “I can tell you that Vera Rubin is in full production.”

Full production. Shipping later this year. That timeline compresses every hyperscaler’s procurement decision window dramatically.

The shift reflects a change in how AI infrastructure generates revenue. Training still matters — nobody’s building frontier models on last-gen hardware — but inference is where the money flows. Every ChatGPT query, every enterprise copilot interaction, every agentic workflow that runs for hours autonomously: that’s inference. And inference economics are defined by cost-per-token, latency, and how much KV cache you can keep hot in memory.

Rubin’s new Inference Context Memory Storage tier, powered by BlueField-4 DPUs, is purpose-built for this reality. It enables sharing and reusing key-value cache state across racks — a feature that becomes essential as context windows expand to millions of tokens and agentic AI systems maintain persistent reasoning state.

The Blackwell transition problem

Here’s the uncomfortable question for every infrastructure buyer who just took delivery of Blackwell systems: what does the depreciation schedule look like now?

NVIDIA has historically maintained roughly annual architecture cadences, but the Blackwell-to-Rubin jump represents a larger-than-usual performance discontinuity. A 5x inference improvement isn’t incremental. It’s the kind of gap that makes existing hardware economically obsolete for competitive inference workloads.

Cloud providers face the sharpest version of this calculus. If you’re selling inference-as-a-service and your competitor has Rubin racks delivering 10x lower cost-per-token, your Blackwell fleet isn’t a competitive asset anymore. It’s stranded capital.

The smart hyperscalers have been planning for this. Microsoft, Google, and Amazon all build their capex models around known NVIDIA roadmaps, and all three are among the first cloud providers to deploy Vera Rubin instances in 2026. But the mid-tier cloud providers and enterprise AI builders who committed to Blackwell clusters in Q4 2025 are facing an accelerated obsolescence curve they may not have fully priced in.

Feynman on the horizon: the agentic architecture

NVIDIA isn’t just selling Rubin at GTC. It’s previewing what comes next. Feynman, named after theoretical physicist Richard Feynman, is the next-generation architecture planned for 2028. It’s being designed specifically for the reasoning and long-term memory requirements of agentic AI systems, featuring advanced 3D die stacking, custom HBM memory, and a new Rosa CPU.

This matters because it signals NVIDIA’s view of where workloads are heading. Today’s inference is largely stateless request-response. Tomorrow’s inference is persistent, multi-turn, autonomous agents maintaining context over hours or days. That’s a different memory and compute profile, and NVIDIA is building silicon specifically for it.

The 1GW signal: Thinking Machines Lab

Perhaps the clearest signal of Rubin’s strategic positioning came from an unexpected direction. NVIDIA and Thinking Machines Lab — the AI startup founded by Mira Murati after her departure from OpenAI — announced a multiyear partnership to deploy at least one gigawatt of Vera Rubin systems. NVIDIA has also made a significant direct investment in the company.

One gigawatt. For a single customer. That’s not a hardware deal. It’s an infrastructure partnership at a scale that would have been unthinkable three years ago. Deployment is targeted for early 2027 on the Vera Rubin platform.

“NVIDIA’s technology is the foundation on which the entire field is built,” Murati said. “This partnership accelerates our capacity to build AI that people can shape and make their own.”

The Thinking Machines deal validates two things at once: that Rubin production capacity is real and substantial, and that the frontier AI lab buildout isn’t slowing down. If anything, the race for compute is entering a new phase where gigawatt-scale commitments are table stakes for serious contenders.

What to watch at GTC

The keynote will formalize what the industry already knows: Rubin is the platform that defines 2027 data center builds. But watch for the details that actually move procurement decisions:

Independent benchmark commitments. NVIDIA’s own claims need third-party validation, especially around the cost-per-token numbers.

Feynman architectural details, particularly how much silicon-level support there is for persistent agent state.

Supply chain timelines. “In full production” and “shipping later this year” need dates and volumes.

Cooling requirements. Rubin’s 100% liquid-cooled NVL72 racks will directly impact every facility’s thermal design.

Every hyperscaler’s 2027 capex plan is about to get rewritten. The only question is how fast.

NVIDIA AI Infrastructure Data Center Infrastructure

Artificial IntelligenceAIAnalysis