Project Prometheus Raised $12B to Train an "Artificial General Engineer." The Training Data Doesn't Exist Yet

Jeff Bezos and Vik Bajaj's startup now has $18.2 billion and roughly 150 employees. What it doesn't have is an internet of manufacturing data, so the corpus will have to be manufactured... much of it on supercomputers.

A CNC machining center with a GPU server stack as its spindle mills a glowing wireframe turbine blade, illustrating manufactured AI training data. — There is no internet of manufacturing data to scrape. The corpus for an "artificial general engineer" has to be machined: simulation tokens cut on GPU clusters, one expensive pass at a time.AI-generated / Supercomputing News

Project Prometheus, the AI startup co-led by Jeff Bezos and Vik Bajaj, announced a $12 billion Series B on June 11 at a valuation of roughly $41 billion, with JPMorgan, BlackRock, Goldman Sachs, DST Global, and Arch Venture Partners joining Bezos in the round, per Axios. Combined with its $6.2 billion Series A last November, the company has raised $18.2 billion in seven months.

The round itself is the least interesting part of the announcement. In their first joint on-record interview, both co-CEOs conceded that the data their model needs does not exist in collectable form. In Axios's formulation, "there isn't an 'Internet of manufacturing data' they can ingest." Bezos drew the contrast himself: large language models "have been trained on a giant corpus of humanity's knowledge that already existed on the internet... What we're doing is very different," he told CNBC. Bajaj put it in one sentence: "You don't build a bridge or a jet engine through words."

Bezos also named the cost driver: "the cost of compute and of building the specialized training data." The verb is doing real work in that sentence. Nobody gets to collect this corpus, because no one has ever assembled it; someone has to make it. That is what the $18.2 billion is for. Prometheus plans to manufacture its training data through simulation, through automated laboratories and, if a widely reported but unconfirmed plan is real, by buying the industrial companies that generate it.

What Prometheus has, and what pretraining needs

Start with what is confirmed, because it is thin. Prometheus employs roughly 150 people across San Francisco, London, and Zurich, per Axios and GeekWire, reportedly hired out of OpenAI, Google DeepMind, Meta, and xAI, according to New York Times reporting from November. It has operated for about 18 months. It has published no papers, announced no datasets, and named no compute partner. No cluster size or GPU count appears anywhere in the coverage to date, which is worth registering before any analysis of its training program.

The product direction is better documented. Investor materials reviewed by the Wall Street Journal describe an initial product of engineering simulation and design software: simulating airflow around a wing, predicting where metal will crack. In May, in his first public description of the company, Bezos said it has "nothing to do with robotics" and called the vision "a very, very modern version" of CAD, per GeekWire. On June 11 he told CNBC's David Faber the goal is for ten engineers to do in one year what a hundred engineers now do in ten.

So what does the company already hold? Prometheus discloses nothing, which means what follows is our analysis. Three signals triangulate. The investor documents point to a simulation-first corpus. The Times's sources said Prometheus takes "a similar approach" to Periodic Labs, where robots run physical experiments and harvest the results, failures included. And Bajaj's pedigree (physicist and chemist, Google X, Verily co-founder, Foresite Labs, Xaira Therapeutics) is lab-in-the-loop science rather than fleet telemetry. All three point the same direction: simulation output plus automated-lab experimental data.

At 18 months and 150 people, that measures in petabytes of simulation output and gigabytes to terabytes of physical experiment data. Periodic Labs, the closest public comparable, describes its autonomous labs as generating "gigabytes of experimental data that exists nowhere else" per the a16z investment memo. Gigabytes. The same memo notes the best language models trained on roughly 10 trillion tokens and calls the internet effectively exhausted. Against that yardstick, what Prometheus plausibly holds is enough to steer a model and nowhere near enough to pretrain one.

Here is how the physical-world data positions actually compare:

Holder	Corpus	Scale	Provenance
Tesla	Real-world driving video + telemetry	~9M vehicles globally (Feb 2026); 11.05B cumulative FSD miles, ~4.16B city miles (Tesla FSD safety page, June 2026); ~29M FSD miles/day (Electrek, May 2026); 8 cameras per vehicle	Customer fleet exhaust — customers pay Tesla to generate it
Waymo	Driverless operation data	~200M fully autonomous miles (2026)	Owned fleet; higher quality per mile, roughly 50x smaller
NVIDIA	Cosmos world-foundation-model corpus + synthetic	20M hours of video, ~9,000T tokens (Jan 2025); Physical AI Dataset >15M downloads; Isaac Lab simulation at data-center scale	Curated + simulation-generated; sells the substrate rather than keeping the moat
Siemens / Dassault / PTC	Decades of CAD, PLM, and simulation exhaust	Unquantified; plausibly the largest engineering corpus on earth (our analysis)	Contractually customer-owned and siloed; unusable for foundation training without consent
Periodic Labs	Autonomous-lab experimental data	"Gigabytes" per experiment cycle (a16z); $300M seed (TechCrunch, Sept 2025)	Self-generated, proprietary; Bezos is an investor
Physical Intelligence	Robot manipulation data	$400M round in 2024, Bezos an investor (New Atlas)	Robot fleet, early scale
Prometheus	Undisclosed	~150 staff, 18 months, $18.2B to spend	To be manufactured: simulation + labs + (reported) acquisitions

One caution on the table's largest number: fleet miles are not curated training tokens. Tesla does not disclose how much of its fleet video actually gets ingested into training, and raw exhaust needs heavy filtering before it becomes signal. The Tesla figures describe how much the fleet can collect; they say nothing about how much of it becomes usable training data. Even discounted steeply, though, the asymmetry against a 150-person startup with no fleet stands.

Three ways to manufacture a corpus

Ranked by the strength of the evidence that Prometheus will pursue them.

GPU simulation at cluster scale. This is what the investor documents describe, and it is the only physical-world data source that scales the way internet text did. The precedent is established. NVIDIA's Cosmos world foundation models trained on 20 million hours of video (about 9,000 trillion tokens) using 10,000 H100s over three months. Robotics has already normalized the pattern: NVIDIA's Grasp-MPC ran roughly two million simulated grasp trajectories before a robot ever touched a real object, a sim-first pipeline now standard across the field. If Prometheus follows the same path for structural mechanics, fluids, and thermals, its training corpus is itself a supercomputing output, manufactured by physics solvers on the same class of clusters that will later train the model. At that point the old boundary between simulation and AI stops mattering; the simulation is the feedstock.

Lab-in-the-loop experiments. The Times's Periodic Labs comparison points here, and Bezos is an investor in Periodic itself. The a16z memo describes the approach as nature becoming the reinforcement learning environment, which captures both the appeal and the limit. Experimental data carries enormous signal per byte, including the failure data that simulation tends to underrepresent, but the volumes are tiny. It is post-training fuel. The pretraining bulk has to come from somewhere else.

Acquisition. In March, the Journal reported that an affiliated fund of as much as $100 billion is being raised to buy industrial companies in chipmaking, defense, and aerospace, with the intent of feeding their data into Prometheus. CAD archives, test-stand telemetry, process logs, failure databases: decades of engineering exhaust that no license can buy at market. Both CEOs declined to discuss the fund on June 11. Bezos allowed only that Prometheus "may buy parts of companies," per GeekWire. A non-answer leaves the question where the Journal left it. This entire branch of the strategy rests on a single exclusive and should be read as reported rather than confirmed. If it is real, it would be vertical integration of a training-data supply chain at a scale no AI company has attempted.

Two thinner channels round out the picture: Blue Origin, which Bezos describes as "a case study for a customer" rather than a corporate affiliate, and licensed CAD or simulation corpora from incumbents, though no evidence of any licensing deal has surfaced.

The incumbents already have flywheels

Tesla's position clarifies what Prometheus is up against. The fleet widens its corpus by some 29 million miles every day at zero marginal cost, because the data is exhaust from a product nine million customers already paid for. Waymo runs the same loop with an owned fleet, trading volume for fidelity. NVIDIA, characteristically, has chosen to sell the data factory rather than own the data. Its positioning at GTC 2026 as the operating system for physical AI puts it upstream of every player in the table above, Prometheus included.

The strangest position belongs to the engineering-software incumbents. Siemens, Dassault, and PTC sit on what is plausibly the largest engineering corpus in existence, accumulated over decades of CAD, PLM, and simulation hosting. Our analysis: they cannot use it. The data is contractually their customers' property, siloed by design, and unusable for foundation-model training without consent that aerospace and defense customers are unlikely to grant. SCN made a related argument about VAST Data's $30 billion bet on the middle layer of AI: value accrues to whoever organizes the data layer, not to whoever happens to store it. Physical-world data sharpens that thesis. Sitting on the corpus is worth nothing if you cannot legally feed it to a model.

Step back and the structural difference is plain, though it is our read rather than the company's framing. Tesla collects passively; Prometheus has to manufacture actively, paying for every token in compute time, lab time, or acquisition capital. That inversion explains the otherwise baffling fundraising arithmetic of $18.2 billion raised, a reported $100 billion more behind it, for a 150-person company with no shipped product. The capital is the data strategy.

The supercomputing reading

If the corpus must be manufactured rather than scraped, the binding constraint on Prometheus becomes the economics of compute rather than algorithmic novelty. Synthetic physics data at pretraining scale means physics solvers and world models running continuously on GPU clusters, and the company that converts capital into validated simulation tokens at the lowest cost wins. That makes the undisclosed compute plan the most consequential thing Prometheus has not said. A company whose stated bottleneck is "the cost of compute and of building the specialized training data" has told us where its money is going; it has not told us to whom.

The sovereignty dimension is material, and it runs through the money. The reported $100 billion affiliated fund targets chipmaking, defense, and aerospace — three sectors where data is treated as a controlled asset as much as a commercial one. The Journal reported that Bezos pitched Middle East and Singapore sovereign wealth funds on the vehicle, and JPMorgan's involvement runs through its Security and Resiliency Initiative, the bank's umbrella for defense-industrial and supply-chain investment. None of this is confirmed by the company. But the pattern it describes, sovereign capital flowing into defense-adjacent industrial data, lands in territory governments already treat as strategic, as the Pentagon's $46 billion sovereign AI infrastructure request for FY2027 makes plain. An American company buying defense and aerospace suppliers to harvest their engineering data would face the same national-security review apparatus that foreign acquirers do, and would create exactly the kind of concentrated industrial-data asset that export-control regimes were built to track.

A last observation, and it is our analysis rather than anything the company has said: Bezos now holds positions across the entire physical-world data spectrum. Co-CEO of Prometheus. Investor in Physical Intelligence and Periodic Labs. Owner of Blue Origin, the self-described first case study of a customer. And, if the Journal's reporting holds, the figure behind a $100 billion industrial acquisition vehicle. Read as a portfolio rather than a collection of bets, it is a single thesis: physical-world data is the next scarce resource in AI, and owning its generation beats renting its access.

Whether the thesis converts is genuinely open; nothing in eighteen months of operation demonstrates that capital turns into corpus at the required scale. Consider the benchmark. Driving a car is one engineering domain, and a decade of effort, nine million data-gathering vehicles, and eleven billion miles have brought Tesla close to solving it but not all the way there. Prometheus is testing whether $18 billion, and possibly $100 billion more, can buy that scale of data for every domain engineers touch.

AI Infrastructure Robotics Physical AI

About the contributor

The SCN Staff is a small AI editorial squad working under human direction. Each agent owns one job.

Scout does the research. It runs down primary sources and checks what's already been published, on SCN and everywhere else, before a story gets written. If a claim can't be traced back to a real document, Scout flags it.

Forge writes. It takes what Scout found and turns it into a draft, argument and sentences and all. Every SCN piece starts here, then gets sharpened.

Cipher handles search: the titles, descriptions, and keyphrase work that decides whether a good article ever gets found. Least glamorous job on the squad. Also one that matters more than it looks.

Pixel makes the visuals. Images, charts, the occasional diagram, all built to SCN's brand instead of pulled from a stock library. When something's easier to see than to read, it goes to Pixel.

Editorial judgment and the final call stay with the humans. So does the fact-checking.

Artificial IntelligenceAINews