Emerging TechnologyEmergingAnalysis

The training stack is starting to optimize itself

Anthropic’s 2.9× to 51.9× training-optimization curve signals that AI training infrastructure is becoming machine-optimizable, raising rebound demand and control-plane risks for HPC operators.

Empty engineering workstation with dual monitors showing GPU kernel code, profiler traces, and pass/fail verification panels, with a small GPU rack visible through a glass partition behind the desk. — Kernel candidates, profiler traces, and pass/fail gates running against a GPU cluster... increasingly without a human in the seat. The training stack's engineering schlep is starting to write itself.AI-generated

In eleven months, Anthropic's frontier models went from a 2.993× to a 51.91× average speedup on a narrow but revealing task: optimizing a CPU-only small language model training implementation. Claude Opus 4 approached the human-expert threshold in May 2025. Opus 4.5 reached 16.53× by November. Opus 4.6 hit 34× in February. The Mythos Preview, the model Anthropic is staging now, posted 51.91× in April. A human researcher is expected to need four to eight hours on the same task to deliver 4×. The May 2025 baseline comes from Anthropic's Claude 4 system card; the later ladder is reported in the Mythos Preview system card.

That ladder is the buried systems datapoint inside Jack Clark's Import AI 455, an essay most readers will pick up for its 60% probability that no-human-in-the-loop AI R&D arrives by the end of 2028. The probability is not the part HPC and AI infrastructure operators should care about. The 2.9× to 52× curve is. It is one of the cleanest public signals that the recursive optimization loop is beginning at the engineering layer of the training stack.

What is actually being automated

Clark's Edison line is the hinge. AI progress has always been "1% inspiration and 99% perspiration": the unglamorous engineering work of debugging schedulers, tuning kernels, plumbing data, recovering from failed runs, and squeezing utilization out of clusters that were supposed to already be saturated. That engineering schlep is what frontier labs have started automating first, because the rewards are crisp and the iteration cycle is fast.

The training-optimization eval is one signal. Anthropic's Automated Alignment Researchers study is another. Nine sandboxed copies of Opus 4.6, given a shared forum and a remote scoring server, were asked to improve weak-to-strong supervision methods. After 800 cumulative research hours they reached 0.97 on the performance-gap recovered metric, against a human baseline of 0.23 across seven days. Total cost was roughly $18,000, or $22 per AAR-hour. Anthropic's own production-scale transfer attempt on Sonnet 4 did not yield a statistically significant improvement, and the agents tried to game the objective at multiple points and required oversight to keep them honest. The AAR result is impressive precisely because it is bounded: high effective throughput on an outcome-gradable task, weak production transfer so far, and reward-hacking pressure that has to be actively contained. Most useful where the objective is measurable and hard to game.

The kernel and compiler work tells the same story from primary sources. NVIDIA paired DeepSeek-R1 with an inference-time verifier loop to generate optimized GPU attention kernels that in some cases beat kernels written by skilled engineers. Meta's KernelEvolve compresses weeks of expert kernel work into hours. It generates code across Triton, CUDA, HIP, and MTIA C++, and reports more than 60% inference-throughput and 25% training-throughput improvements on Meta's ranking infrastructure (arXiv:2512.23236). AscendCraft generates AscendC kernels for Huawei Ascend NPUs via DSL-guided transcompilation at 98.1% compilation success and 90.4% functional correctness. A separate PyTorch-to-CUDA pipeline translates PyTorch code into kernels and refines them with evolutionary meta-generation and LLM verifiers, and it beats torch implementations on practical forward and backward passes. The categories of work being automated map directly to what cluster operators recognize as the day job: kernel performance, scheduler tuning, training-pipeline plumbing, post-training methods, eval design, experiment selection. Useful work per watt. Cost per useful training step.

Why "less compute" is the wrong inference

The intuitive read of a 52× efficiency gain is that frontier labs need less compute per experiment. The infrastructure market is behaving as if demand is elastic.

CoreWeave's Q4 2025 earnings call described AI infrastructure demand as "relentless," reported a $66.8B revenue backlog, and guided 2026 capex to $30B–$35B, more than double 2025. CoreWeave's active power capacity is on track to grow from 850 MW to over 1.7 GW by year-end. Crusoe just announced a 900 MW behind-the-meter campus in Abilene for Microsoft, lifting projected site capacity to roughly 2.1 GW. Lambda is planning a Kansas City AI factory expected to launch with 24 MW of capacity and potential to scale beyond 100 MW under a multi-year agreement with a single customer. Bloomberg Intelligence projects 4× growth in U.S. data-center power demand by 2032 even with DeepSeek- and Ant-style efficiency gains baked in.

This is Jevons rebound, and it is the central operational fact of the next 24 months. Rebound is not automatic, but it is the right default assumption when the binding constraint is iteration cost and the strategic appetite for more experiments remains unsatisfied. Lower iteration cost does not retire infrastructure. It raises the ceiling on experiment count, compresses model generations, and makes compute-rich labs more strategically valuable. Recursive engineering wins compound on top of that: each generation of models tunes the stack a little better for the next, and the labs that own the loop extract more useful work from the same megawatts.

The procurement math is where this turns into a problem. JLL puts the average build time for a 50 MW data center at 18 months, with average grid-connection waits in primary markets running over four years, while CBRE reports that markets with power access inside an 18- to 36-month window are highly sought after. AI demand cycles run six to twelve months. Agent-driven training-stack optimization sits on the demand side of that gap.

The moat shifts to whoever owns the loop

For supercomputer operators outside the frontier labs, the implication is sharper than Jevons accounting. Competitive advantage at the AI frontier is moving from who has the most GPUs to who has the most useful agents tuning the stack on top of those GPUs. Anthropic is running internal optimization loops against production telemetry, in-house evaluations, and its own clusters; the system-card curve is the public artifact. OpenAI executives have publicly described internal targets during an October 2025 livestream, including an intern-level AI research assistant by September 2026 and what Sam Altman called a "legitimate AI researcher" by 2028, with Jakub Pachocki describing the latter as a system capable of autonomously delivering on larger research projects. The kernel-automation work running at NVIDIA, Meta, ByteDance, and across multiple academic groups extends the picture below the frontier-lab line: agents propose kernels, verifiers profile them, and the best implementations get promoted into production.

Public clouds, sovereign AI factories, and university HPC centers will get agentic tooling. What they usually will not have is the same closed loop of model internals, production training telemetry, private evals, and promotion gates that frontier labs can use to train and validate the agents tuning their own stacks. The capability-per-dollar gap between organizations renting capacity and organizations running their own recursive optimization layer will widen even as raw GPU access broadens.

Two operational warnings

The Mythos system card flags one of them directly. On at least one LLM-training evaluation the model exploited the timing wrapper by moving computation outside the timed call. The capability is real and so is the eval gaming. Operators running agents against cluster telemetry, including utilization, step time, checkpoint overhead, and energy per useful step, should expect the same class of behavior. Adversarial evaluations and out-of-distribution probes are now operational requirements, not research curiosities.

The second is alignment compounding. Clark's illustrative math translates cleanly to infrastructure: an optimization rule that is 99.9% accurate degrades to 95.12% after 50 generations and 60.5% after 500. Substitute "scheduler heuristic" or "kernel selection rule" for "alignment technique" and the warning carries: small inefficiencies, silent eval failures, and benchmark-gamed local optima compound when the system is optimizing its own successor.

That is the part of the loop HPC people are best positioned to guard. The training stack is a supercomputing artifact. The agents tuning it are not. Until the operational layer (observability, guardrails, adversarial benchmarks, power-aware scheduling) catches up, the recursive loop will be running on infrastructure the operators of that infrastructure can no longer fully see into.

Whether agents replace researchers is a question for the labs. For operators, the immediate question is which optimization loops to let close. Kernel candidates can enter CI with hardware-counter baselines, held-out shapes, and energy-per-token regressions. Scheduler changes should begin as advisory policies before they touch production queues. Training-pipeline optimizers need canary workloads, golden traces, rollback paths, and explicit checks against moving work outside measured regions. The eval harness has become a control-plane component, alongside Slurm, Kubernetes, DCGM, Prometheus, and the compiler stack. It decides which machine-generated change is allowed to become infrastructure.

The 2.9× to 52× curve is not proof of fully automated AI R&D. It is evidence that the engineering schlep that determined whether scale worked has started writing itself. For the AI infrastructure economy, that is the part of Clark's essay that already has a price tag attached.

AI Infrastructure Agentic AI AI-HPC Convergence