High-Performance ComputingHPCAnalysis

Reconstructing FP64: How Supercomputing's Establishment Is Adapting Science to AI Silicon

Two papers, Matsuoka's FP8-emulation preprints and the Dongarra 'Ride the Wave' paper, point to a field adapting scientific computing to AI silicon it no longer controls.

Monumental numeral 64 under reconstruction from thousands of small blocks, one digit complete and the other half-built behind scaffolding, illustrating FP64 precision being emulated from 8-bit arithmetic. — Double precision, rebuilt from smaller pieces: AI-era GPUs have cut native FP64 to a sliver, and software now reconstructs it from FP8 arithmetic. How much of the structure the rebuild can hold is the open argument.AI-generated / Supercomputing News

Reported analysis. Technical figures are drawn from Satoshi Matsuoka's two-part "FP8 is All You Need" preprint and are attributed throughout as his projections and benchmarks, not established results; several are outputs of an analytic model rather than measured hardware runs. The strategic claims come from the Dongarra, Reed, and Gannon paper hosted at netlib. The convergence between the work, the reading that they describe the same shift in where scientific computing's center of gravity sits, is this publication's interpretation, labeled as such. Neither Matsuoka nor Dongarra cites the other. Both authors responded to questions from Supercomputing News by email on June 11; their comments are quoted below. Matsuoka's correction to the preprint's headline throughput figure was provided in that correspondence. The revised Part 1 (dated June 13) and a new Part 2 (dated June 15) had not yet appeared on arXiv at the time of writing; Matsuoka, citing arXiv's posting backlog, shared both with Supercomputing News directly and intends to circulate direct download links until the arXiv versions post. Disclosure: Supercomputing News is named in the acknowledgements of the revised preprint; this analysis was conducted independently, and the editorial reading below is the publication's own.

Two documents the supercomputing field produced this spring, both from some of its most credible figures, start from opposite corners of the discipline and arrive at compatible diagnoses. One is a hardware-and-algorithms preprint, dense with throughput tables, written by the RIKEN director responsible for Fugaku's development. The other is a strategy paper from the most decorated name in numerical computing and two longtime collaborators, written in the register of national policy. They do not cite each other. Read together, they point to the same shift: native double-precision silicon can no longer be assumed as the foundation of future scientific computing, and the response taking shape is algorithmic rather than architectural.

That reading is ours, not a claim either set of authors makes about the other. But the two halves fit with unusual precision. Satoshi Matsuoka, director of the RIKEN Center for Computational Science, supplies the how: a mechanism by which 64-bit accuracy is reconstructed on AI-optimized GPUs like the B300 that have sharply cut native FP64. Jack Dongarra, Daniel Reed, and Dennis Gannon supply the why and the what-next, an argument that the AI market now dictates the silicon, that peak FLOPS has stopped being the metric that matters, and that science needs a national program built around energy rather than speed. And RIKEN, Matsuoka's own institution, has already started building toward that future, a thread this publication covered when Japan's next flagship machine de-emphasized the Top500 chase.

Both men responded to this publication's questions by email on June 11, and their answers reshape the record. Matsuoka told Supercomputing News that the preprint's headline throughput figure was a mistake, and in the days since he has gone further than a correction: he has posted a heavily revised Part 1 and a new Part 2, both shared with this publication ahead of their arXiv appearance, that cut the number and strengthen the claim it sits inside. Dongarra, asked whether he trusts software-reconstructed FP64 for production scientific codes, sent back a detailed, point-by-point assessment whose first substantive sentence calls the preprint's central claim "overstated." The shift both documents describe stands; the dispute over its reach is now sharper than it was a week ago, and on the record.

The cliff in the datasheet

Start with the number that triggered the argument. In FP8 is All You Need (Part 1), first posted to arXiv on May 28 (the originally posted version, whose headline throughput figure the revision below corrects), Matsuoka points to NVIDIA's Blackwell Ultra, the B300, and what its specification did to double precision. He writes that the part collapses native FP64 to roughly 1.3 TFLOPS, a regression of about 30× from the B200. Set against the table he assembles, the fall is starker than a one-generation dip. Matsuoka's figures put FP64 vector throughput at 34 TFLOPS on the H100, 40 on the B200, and back up near 33 on the coming Rubin R200, with the B300 sitting at about 1.3 in the middle of that line, a regression even against the 2022-era H100. These are his readings of vendor datasheets, restated in a preprint that has not been peer reviewed. NVIDIA's own HGX specifications corroborate the magnitude of the drop. Its published tables put the B300 near 1.25 TFLOPS of FP64 per GPU against roughly 37 on the B200, though NVIDIA's public datasheets do not frame the change as a deliberate regression, and the right of reply on its own silicon belongs to the company.

This is not a claim that FP64 disappears from future hardware. NVIDIA's Rubin specifications still list native FP64 at about 33 TFLOPS per GPU, while also publishing an emulated DGEMM path (double-precision matrix multiply reconstructed on tensor cores) as an explicit, supported performance number near 200 TFLOPS, a figure that Matsuoka notes exceeds his own model's dense-FP8 ceiling of roughly 108 TFLOPS and so likely reflects a different substrate or modulus choice. What the B300 does abruptly, Rubin formalizes: native double precision is no longer the assumed center of the design, and emulation is becoming a first-class path beside it. The distinction matters, and the two parts should not be collapsed into one. The B300 is the cliff; Rubin is the redesign that treats emulation as a feature, not a fallback.

The reason the cliff exists is not an engineering failure, and that is the part the strategic paper is built to explain. Double precision is expensive in transistors and power, and the buyers who set NVIDIA's roadmap, the AI training market, do not need it. Dongarra, Reed, and Gannon put the consequence plainly: scientific and technical computing, they write, "is increasingly a specialized, policy-driven niche riding atop hardware and software stacks optimized for other, much larger markets." The FP64 cliff is what that sentence looks like when it reaches a datasheet. The silicon is being designed for someone else, and the science has to find a way to run on it anyway.

Where the software steps in

Matsuoka's claim is that it can, and the revised paper makes it far more aggressive than reconstruction. The mechanism is the Ozaki Scheme II, an evolution of a method for reconstructing high-precision matrix products out of low-precision arithmetic. In his account it runs in three phases: integer scaling of the inputs, a set of modular matrix multiplications carried out through the chip's FP8 tensor path, and a reconstruction of the full double-precision result using the Chinese Remainder Theorem and Garner's algorithm. The arithmetic the AI market paid for, dense FP8 tensor throughput, becomes the substrate on which 64-bit physics is rebuilt.

What the revision sharpens is the thesis wrapped around that mechanism, and Matsuoka states it without hedging: the FP8 tensor-core matrix-multiply is the sole computational primitive double-precision science needs. Every canonical HPC kernel, and every application that composes them, reduces to sequences of FP8 matrix operations through Ozaki II. The only non-FP8 arithmetic anywhere in the stack is a bounded, fixed-width integer accumulation in the reconstruction step. Native FP64 silicon, on this account, is "not a hardware requirement but a derived accuracy guarantee" produced by composition over the FP8 primitive. The new version organizes the argument as a five-layer hierarchy, from the FP8 op at the base through Ozaki II, the "Berkeley dwarfs" of numerical computing, the composite solver kernels of real applications, and full codes at the top, and it leans on the dwarf taxonomy's completeness to argue this is coverage of the whole field rather than a sample of convenient kernels. The framing that once asked whether software could keep pace now asserts that one hardware operation is all the silicon must provide.

The throughput he projects is where the record has already moved. As originally posted, the preprint argued the scheme "vaults emulated FP64" to roughly 500 TFLOPS on the B300, "exceeding even B200's native FP64 ceiling by over an order of magnitude in the compute-bound regime." That figure is superseded. In email to Supercomputing News on June 11 Matsuoka first put the corrected B300 number at "about 150 TFLOPS"; the revised preprint settles it at roughly 135 TFLOPS, the value that follows directly from the corrected cost model: 5,000 TFLOPS of dense FP8 divided by a 37× emulation multiplier. The error, he explains in the revision's changelog, was using a per-modulus cost of α=r where the correct figure is α=3r+1: each residue product on the FP8 substrate expands into three FP8 multiplies (the Karatsuba-style structure used to emulate signed int8 on FP8) plus a max-magnitude pass. He credits NVIDIA's library team for flagging it. "Still a big number," he wrote, and one that "will not change the fact that it will be applicable to sparse codes."

The correction moves the multiplier without moving the argument's floor, because it is confined to the compute-bound ceiling. At 135 TFLOPS, emulated FP64 on the B300 still sits about 104× above the 1.3 TFLOPS the silicon offers natively. What it retires is the paper's most quotable line: 135 is roughly three and a half times the B200's native FP64 ceiling, not the "over an order of magnitude" the original version claimed. And because the kernels that dominate scientific codes, sparse products and stencils among them, are bound by memory bandwidth rather than arithmetic, Matsuoka can give up two-thirds of his headline number without giving up the claim he cares about: those kernels still reach the memory roof at full FP64 accuracy. The memory-bound result, by his account and the structure of the model, is untouched by the correction.

That claim rests on a model Matsuoka offers as its justification. He calls it Tensor-Memory Equilibrium, an extension of the familiar Roofline that adds terms for the extra compute the emulation spends, the bandwidth it costs, and the latency of reconstruction. His argument is that with register-level fusion the bandwidth penalty falls toward nothing, which makes the emulation "essentially free behind the memory wall." The distinction a reader has to hold onto is which numbers are which. The throughput headlines are outputs of his model, not measured hardware runs, and the original versions overstated them by the author's own account, while the accuracy results lean on implemented kernels.

Part 2, dated June 15 and likewise shared with Supercomputing News ahead of posting, supplies the kernel the first paper deferred: the three-dimensional FFT, the one canonical primitive whose small inner dimension makes reconstruction latency, rather than bandwidth, the binding cost. Its answer is an Ozaki-Bailey FFT paired with a "Kulisch escape route": a fixed-point accumulator that runs the exact double-precision reduction on the integer vector pipe the FP64 collapse never touched, projecting a 1024³ FFT at roughly 18 ms against a 12.9 ms memory roof, at full 53-bit accuracy. And it pushes the thesis one step past its own title. A further construction removes even the integer pipe, leaving FP8 tensor cores and HBM as the only silicon a full-FP64 FFT touches. Matsuoka calls this "true and total FP8 is all you need," and flags it as a theoretical upper bound pending implementation. FFT is also, not incidentally, one of the workloads Dongarra named as not mapping cleanly onto tensor-core matrix multiply; Part 2 is Matsuoka's direct answer on that kernel.

The editorial mandate this publication tracks as the Software Ceiling usually asks whether the software stack is keeping pace with the hardware. Matsuoka's papers invert the question. Here the hardware retreated and the numerical-software layer (mixed precision, modular arithmetic, iterative refinement) stepped forward to hold the line. If the corrected projections survive independent benchmarking, software is not the ceiling in this story. It is the floor that caught the fall.

"Its central claim is overstated"

Dongarra has read the preprint, and his assessment, sent to Supercomputing News on June 11, is the most direct engagement between the two halves of this story to date. It addresses the originally posted Part 1, the version that argued FP8 emulation could reconstruct FP64, before the revision sharpened the claim to FP8 as the sole primitive. "The paper has a catchy title and an interesting premise, but its central claim is overstated," he wrote. "It presents Ozaki-style FP8 emulation as if it were a general replacement for native FP64 across HPC. That conclusion is not justified by the evidence."

The substance of his objection is about where Ozaki methods naturally live. They are "most naturally suited to matrix products," he wrote, useful for GEMM-like operations when the matrices are large enough to amortize the decomposition, reconstruction, and data-movement overheads, "but that is a much narrower claim than saying FP8 is sufficient for HPC." Extending the approach to sparse matrix-vector products, stencils, FFTs, reductions, preconditioners, sparse direct solvers, communication-heavy solvers, and full multiphysics applications "is far more complicated," he argued, because those workloads carry irregular memory access, limited arithmetic intensity, and synchronization and communication costs "that do not map cleanly onto dense tensor-core matrix multiplication."

His deeper objection concerns what FP64 actually means in production. Even if an Ozaki-style scheme reproduces FP64-level accuracy for a matrix product, he wrote, "that does not imply IEEE FP64 behavior across an application": correct NaN and Inf handling, exception semantics, reproducibility, stable convergence, robustness on ill-conditioned inputs. "Those properties matter in production scientific computing, and they cannot be assumed from a roofline model." The performance projections likewise "rest on assumptions that need validation," among them low reconstruction overhead, tolerable register pressure, and kernel fusion that succeeds without spilling.

The verdict he lands on is calibrated rather than dismissive. The paper "is therefore best read as a provocative roofline-style argument, not as proof that FP8 can replace FP64 for HPC," he wrote. His defensible version of the conclusion: FP8 emulation "may be valuable for selected FP64 kernels on FP64-starved architectures, provided substantial kernel engineering, numerical validation, and production-quality implementation work succeed. That is an interesting and useful direction. But it is not the same as demonstrating that native FP64 is no longer needed for scientific computing."

The sequence matters to how the two documents read together. The revision did not retreat under Dongarra's critique; it claimed more. Where the posted version presented emulation "as if it were a general replacement for native FP64," the revised version asserts the stronger structural claim outright (FP8 as the sole primitive, native FP64 demoted to a derived guarantee) and answers the "does not map cleanly" objection with the Berkeley-dwarfs coverage argument and, for FFT specifically, with Part 2. Dongarra has not responded to that strengthened thesis, and his comments should be read against the version he saw. On its face the revision widens the target his critique was aimed at.

It also, in one respect, narrows the distance. The revised paper draws its own boundary: it concedes a bounded set of corner cases it cannot yet cover (latency-bound small-kernel codes such as accelerator beam dynamics, and stochastic methods like FCI-QMC) and calls them algorithmic limits, not deficiencies of FP64 silicon that more FP64 units would cure. The technical disagreement is clearest around sparse codes. Matsuoka's position is that sparse kernels are memory-bound, so the emulation's compute cost hides behind the memory wall and his "applicable to sparse codes" claim survives the smaller number. Dongarra's is that Ozaki-style methods are best established for dense matrix products and have not yet been demonstrated convincingly for SpMV-like workloads, whatever the roofline says. Those are different objections aimed at the same kernel class, and benchmarking on real sparse workloads would narrow the gap between them. Settling the question would take more: multiple matrices and architectures, convergence tests, special-value behavior and reproducibility checks, and solver-level results in end-to-end applications.

Matsuoka, shown the critique, did not retreat. In a further email to Supercomputing News he argued the preprint should be read the way the field reads the Berkeley "dwarfs" paper, as theory that establishes an upper bound, deliberately separated from the engineering questions that follow; the two papers, he wrote, "establish the theoretical grounds saying that, if the premise is true, then FP8 is all you need." The fuller version of that argument, and the research program behind it, deserves its own treatment, and will get it in these pages. What belongs in this one is the spirit of his reply. "I welcome being proven (at least partially :-) wrong by my best friend Jack Dongarra," he wrote, emoticon included. "I hope this debate will arouse many interests among our peers."

A lineage, not a coincidence

It would overstate the case to call the two papers a coordinated front, and understate it to call their agreement an accident. The specific documents do not reference each other: the Dongarra paper contains no mention of Matsuoka or of the Ozaki scheme, and Matsuoka's preprints do not cite Ride the Wave. But Matsuoka is not arguing in a vacuum sealed off from Dongarra's camp. He quotes Dongarra directly, a remark from SC25 that "the 64-bit performance does not improve," as corroboration for the regression he is documenting, and he builds his numerics on the mixed-precision program Dongarra spent years developing with Nicholas Higham and others. The convergence is better understood as a research lineage than a coincidence. One of the field's foremost strategists has been saying for some time that precision would have to be spent more carefully; one of its foremost system builders has now tried to show exactly how far that idea can be pushed, and in the revision pushed it further than the version Dongarra critiqued. The friction that surfaced in their June 11 responses sits precisely where the lineage runs out: both men agree precision must be budgeted, and they disagree on how much of science can run on the budget Matsuoka proposes.

Joules, not FLOPS

If Matsuoka's papers are about throughput, Ride the Wave, Build the Future is about what we should be counting instead. It is a follow-on to the 2023 paper in which the same authors laid out five maxims for scientific computing in an AI world; the new version carries seven. The one that anchors this story is the second: that energy and data movement, not peak floating-point rate, are now the dominant constraints, and that the field should adopt "joules per trusted solution" as a primary metric. Peak FLOPS, the authors write, and even time-to-solution, "are no longer sufficient." The related claim that uniform 64-bit arithmetic will give way to solvers that partition work across FP64, FP32, BF16, FP8, and integer-emulated formats is made in the body of the paper, and reads almost as a description of Matsuoka's mechanism. Dongarra's critique and his paper are consistent on this point: precision partitioned by solver stage, with native FP64 reserved where it is needed, is the paper's own forecast. What he rejects is reading one dense-kernel result as the whole transition.

What the metric is built toward is a program. The paper culminates in a call for "a countervailing national program whose primary objective is not peak capability, but orders-of-magnitude reduction in joules per trusted solution," quantified in the abstract as roughly a hundredfold cut in energy per validated scientific outcome. They name the commercial trajectory science is now measured against: xAI-style "Colossus" builds, what the paper calls Oracle's "OCCI-class" deployments, and the zettascale-aspirational AI campuses whose economics scientific computing cannot match and increasingly has to borrow. That is the access question this publication keeps returning to, made concrete at the level of arithmetic. When the resource is built for commercial AI, the same pressure that led one AI lab to pre-purchase multiple gigawatts of capacity science will eventually need is the pressure now reshaping the FP64 line in NVIDIA's roadmap.

A procurement decision that points the same way

The third leg of this is not a paper. It is a procurement decision. The peak-FLOPS metric that both documents argue is no longer sufficient is the metric Top500 was built to rank, and the erosion of that ranking's authority has been visible for a while: the private AI superclusters that now shadow the Top500 are, by their operators' own disclosures, larger than most of the machines on it, and are not submitted for ranking. RIKEN, the institution Matsuoka runs, has designed its Fugaku successor, FugakuNEXT, around application performance and energy inside a fixed power budget rather than a LINPACK score, pairing Fujitsu CPUs with NVIDIA GPUs for the first time in a Japanese flagship and targeting up to 100× application performance within roughly Fugaku's 40 MW envelope. This publication reported on that choice when Japan's flagship program de-emphasized the Top500 chase. Read against the two papers, it looks less like a one-off national preference than a working example of where the argument leads: the strategy paper's diagnosis and the algorithms paper's mechanism, showing up together in a machine that is actually being built.

The sovereignty seam here is less about which flag flies over the silicon than about how national programs respond to hardware they no longer dictate, and the response is starting to look collaborative rather than competitive. In June the U.S. DOE and Japan, FugakuNEXT's own government, committed a combined $1 billion to coordinate AI, quantum, and high-performance computing across all twelve U.S. national labs and a dozen Japanese institutes, an expansion of the Genesis Mission this publication covered at its $293M stage.

None of this is settled, and the days before publication made that concrete. The preprint's headline number has already fallen by the author's own hand, from roughly 500 TFLOPS to about 135, even as the thesis around it hardened into a claim that FP8 is the only compute primitive science needs. The field's most decorated numerical analyst has read the earlier version and called its central claim overstated, while calling the direction "interesting and useful." The Dongarra paper's moonshot remains a recommendation, not a budget line. What the documents and the correspondence around them establish is narrower and more durable than any single number: the field's most influential figures now treat native FP64 as a resource to be budgeted, emulated, or reserved rather than assumed, and as something the software stack will increasingly have to reconstruct. The open argument, the one Matsuoka and Dongarra are now having on the record, is how far emulation runs beyond the dense kernels where everyone agrees it works.

That argument has a venue booked. At ISC 2026 on June 23, Dongarra takes the stage to present the 67th Top500 list and its awards, the ranking built on the LINPACK benchmark he created and ordered by the peak-FLOPS metric his own paper argues is no longer sufficient. He then holds the conference's closing-keynote slot, under the title "HPC in Transition," and the published abstract reads like a précis of this story. The next era of scientific capability, it argues, "will be measured less by peak floating-point rates and more by time-energy-fidelity trade-offs across end-to-end pipelines," and the most plausible path to "effective zettascale" is "not brute-force FP64, but certified mixed-precision algorithms, communication-avoiding methods, AI-augmented reduced-order models, and hybrid AI+simulation workflows with rigorous error control and uncertainty quantification." Certified is the word that carries his position. The keynote does not mention emulation; it makes the case for mixed precision under rigorous error control, the broad program Dongarra has long championed, of which Matsuoka's scheme is one aggressive instance. His comments to this publication draw the boundary the same way: reconstructed FP64 earns a place in production scientific computing when it arrives with verification attached. He has not said it is there yet.

A follow-up piece after ISC will take up Matsuoka's fuller response to the critique: his case for reading the preprints as a theoretical upper bound, the Part 2 FFT and Kulisch results in detail, and the research program he says will test them, alongside the revised preprints once they post to arXiv.

Top500 ISC AI Infrastructure NVIDIA Research Computing

About the contributor

Matt Walters is the founder and publisher of Supercomputing News. He also runs OmniScale Media, the marketing agency he co-founded in 2017 to serve AI, HPC, quantum, and deep tech companies.

He's spent 15+ years in this world. Seven of them at Tabor Communications as VP of Digital Strategy, where he grew audience and sponsorships for HPCwire and helped launch Datanami (now BigDATAwire) and EnterpriseTech (now AIwire). Along the way he's built hundreds of campaigns for tech companies of all sizes - from early-stage startups to NVIDIA, Intel, IBM, HPE, and others... including NVIDIA's early push to sell GPUs for AI, back when that was still a bet.

A builder at heart, he spent 15 years in the construction trades before any of it.