Argonne Turns a Plain-English Prompt Into 11,182 GCMC Runs on Aurora

A planner/executor agent hierarchy drove a 5,591-MOF screening campaign across 256 Aurora nodes. Orchestration overhead came in under 90 seconds per run.

A row of HPE Cray cabinets at Argonne's ALCF spelling "Aurora" in large white letters across teal, green, and magenta panels.
Aurora at Argonne's Leadership Computing Facility - the exascale HPE Cray EX system that hosted the agentic MOF-screening run.Argonne National Laboratory

A team from Argonne has put a concrete answer on the table for a question that has been hanging over agentic AI in HPC: can a swarm of LLM agents actually run a leadership-class workload without becoming the bottleneck? In an arXiv preprint posted 9 April, researchers from Argonne's Computational Science Division and the Argonne Leadership Computing Facility describe a planner/executor agent stack that translated a single natural-language prompt into 11,182 Grand Canonical Monte Carlo simulations, screening the CoRE MOF 2025 database of 5,591 metal-organic framework structures for atmospheric water harvesting. The largest production run used 256 Aurora nodes concurrently.

Start with the architecture. Single-agent chemistry frameworks like ChemCrow or the authors' own ChemGraph run a ReAct loop (reason, call a tool, wait, reason again), which serializes execution and kills any hope of filling an exascale machine. The Argonne design splits the work. One planner agent decomposes the objective and spawns a dynamically sized pool of executor agents. A data-analyst agent aggregates at the end. Two MCP servers sit behind the agents: a Chemistry server that exposes simulation-launching tools, and a DataTool server for ranking. MCP, short for Model Context Protocol, is Anthropic's open tool-binding standard originally aimed at desktop assistants.

Here is the trick that makes it work: the Chemistry MCP tools don't run simulations. They emit Parsl applications, and Parsl handles placement, concurrency, and fault tolerance across Aurora's nodes. The LLM never touches a scheduler. Reasoning ran on OpenAI's open-weight gpt-oss-120b served through ALCF's inference endpoints. That on-prem choice matters for DOE-scale campaigns, where API token bills and data-governance rules tend to rule out frontier-model APIs.

Now the numbers. Orchestration overhead landed at 60 to 90 seconds per run against GCMC jobs that take 1,600 to 4,400 seconds each, which is close to free. Weak scaling stayed flat from 1 to 256 nodes at a fixed nine MOFs per node. Strong scaling was near-linear from 8 to 32 nodes, tapering to 64.9% efficiency at 256. Individual GCMC jobs ran on a single tile of an Intel Data Center GPU Max 1550 using gRASPA, the GPU Monte Carlo code published in J. Chem. Theory Comput. in 2024. On Aurora (with 1.012 ExaFLOPS HPL and 60,000+ GPUs), that works out to twelve concurrent simulations per node and thousands in flight at peak.

Reliability was 84%: 21 of 25 scaling experiments completed cleanly. All four failures traced to gpt-oss-120b emitting malformed tool-call arguments, not to the orchestration layer. The scientific output (the top 20% of screened MOFs hit a water working capacity of 7.06 mol/kg between 60% and 10% relative humidity at 298 K) is a shortlist for the kind of AWH work Omar Yaghi's group has been chasing for a decade.

For the broader HPC audience, this is a reusable template. Swap gRASPA for any Parsl-dispatchable code and the same planner/executor/analyst pattern applies to climate, biology, or cosmology workloads. It is also a real datapoint for MCP escaping the IDE and becoming the tool-binding layer between LLMs and scientific computing infrastructure.

A few caveats worth flagging. It is a v1 preprint, not peer-reviewed. Scaling past 256 nodes on a 10,624-node machine is asserted but not demonstrated. There is no head-to-head against a hand-written Parsl workflow, so the only number we have for the cost of agentic orchestration is that 60 to 90 second overhead. And 84% reliability on 25 runs is a starting line. The failure mode (bad JSON) is the kind of thing better fine-tuning and schema validation tend to fix, but it still needs to be fixed.

The number you should watch is that reliability figure. If the next generation of open-weight reasoning models pushes it from 84% toward 99%, the conversation about agentic HPC stops being about feasibility and starts being about who ports their workflow first.

🤖 AI Disclosure

AI-assisted research and first draft. This article has been verified by a human editor.