Three Bets Against Nvidia's Inference Margin, One Shared Dependency

OpenAI, Qualcomm, and Etched are betting against Nvidia's inference margin. Escaping it means queuing for the same TSMC packaging and memory. Most ASIC challengers die on software, not silicon.

Three dark custom AI chips mounted on one glowing substrate with memory stacks beneath it, three diverging paths behind them. — Three routes away from Nvidia's GPUs, one layer under all of them: advanced packaging and high-bandwidth memory remain the foundation every custom-silicon escape stands on.AI-generated / Supercomputing News

In the last week of June, three companies placed the same bet against Nvidia in three different ways: that the margin Nvidia collects to run large language models has grown big enough to design around. One is a buyer building its own way off the GPU. The other two are sellers offering everyone else the exit. On June 24, Qualcomm formalized Dragonfly, a merchant data-center accelerator roadmap aimed at inference, around its investor day. The same day, OpenAI and Broadcom unveiled Jalapeño, OpenAI's first custom chip, built for nothing but LLM inference on OpenAI's own workloads. Six days after that, the startup Etched disclosed roughly $800 million raised and over $1 billion in customer contracts for Sohu, an ASIC specialized for transformer inference.

The shared bet is easy to state. Inference, not training, is where the recurring at-scale compute cost now lives, and Nvidia's gross margins on the GPUs doing that work have become a line item large enough to design around. For OpenAI, building its own silicon means keeping the margin Nvidia would otherwise take. For Qualcomm and Etched, the play is to undercut that margin and win the buyers who no longer want to pay it. The target is the same from either side of the invoice.

The complication is where the real story lives. Designing around Nvidia does not, on its own, escape the constraints that make an Nvidia GPU expensive in the first place. The margin you stop paying Nvidia does not vanish; a large share of it moves one layer down the supply chain, to TSMC's advanced-packaging lines and to the handful of vendors that make high-bandwidth memory. And the history of inference silicon says the chip itself is rarely what kills a challenger. Software maturity and volume economics are. Every announcement here is real, and not one ships in volume yet. Hold both facts at once and the week looks less like an escape than a renegotiation of who collects the toll.

The three bets, and what is confirmed

Jalapeño is the most consequential of the three because of who is buying it. OpenAI designed the chip; Broadcom provided the silicon implementation and networking, per OpenAI's own release, which also puts the design in lab testing after a roughly nine-month run from architecture to tape-out. What sits under the lid is mostly inference rather than disclosure. Neither company has confirmed the die size, memory configuration, or process node (widely reported as TSMC 3nm-class, unconfirmed). Tom's Hardware reads the package as a reticle-sized ASIC with six HBM stacks visible; EE Times counts six or eight from the same images. That is analysis of photographs, not specification. The economics claim traces to a single source as well: Broadcom chief executive Hock Tan told reporters the chip targets roughly half the cost per inference token of the GPUs OpenAI currently rents. The chip is in lab testing. It is not shipping. Deployment is slated to begin in the second half of 2026 and run toward the end of the decade, inside the roughly 10-gigawatt accelerator-and-Ethernet commitment OpenAI and Broadcom announced in October 2025. A circulated claim that one hyperscaler reserved a large share of initial production does not hold up to sourcing and is set aside here.

Qualcomm's Dragonfly is a different kind of bet, a full roadmap rather than a single part: the AI200 rack-scale inference product sampling in 2026, the AI250 in 2027, the AI300 around 2028. The headline numbers are Qualcomm's own. The company claims effective memory-bandwidth gains of 18x and 54x for the later parts over the AI200, and performance-per-watt multiples of four to eight times over GPU-based architectures. These are investor-day estimates, not independent benchmarks, and should be read that way. One piece of the roadmap is contracted rather than claimed: Meta has signed a multi-generation agreement around Qualcomm's C1000 data-center CPU, with production starting in the second half of 2028.

Then there is Etched's Sohu, the boldest architectural wager in the group. Sohu began as the purest specialization bet in AI silicon, an ASIC built around the transformer, and Etched says it now has A0 silicon back from TSMC's N4P process, a reported investor roster that includes Jane Street and a venture arm tied to TSMC, and over $1 billion in customer contracts against first racks the company says ship this summer. No customers are named. The performance figure that made Sohu famous, on the order of 500,000 tokens per second on Llama 70B in an eight-chip server, roughly 20 times an Nvidia H100, is Etched's own claim from internal testing and has never been independently benchmarked. The specialization has also softened at the edges: Etched's current materials describe systems targeting many-trillion-parameter mixture-of-experts models, long context, and agentic workloads, claims that are likewise vendor-reported. The shape of the bet is unchanged, though. This is silicon committed to one family of architectures, and a valuation reported at $5 billion rides on that family staying dominant long enough to amortize the design.

The counter-argument is the spine, not a footnote

Start with packaging, because that is where the real toll booth sits. The HBM-based challengers run straight into the supply chain that shapes Nvidia's own. TSMC's CoWoS advanced packaging, the step that marries logic to high-bandwidth memory, has been sold out through the end of 2026 and is structurally tight into 2027, and Nvidia holds roughly 60 percent or more of that capacity by industry estimates, with priority HBM access on top. Jalapeño's memory configuration puts it in that queue, and so does any challenger that needs HBM-class bandwidth the conventional way. The custom-silicon buyer does not route around Nvidia here; it queues behind Nvidia at the same fab, for the same scarce process, on terms Nvidia's volume helped set. This is an allocation problem, not a supply problem, and it is one Supercomputing News has argued is the defining shape of AI infrastructure in 2026: when the bottleneck is who gets the packaging slots rather than whether the parts exist, incumbency compounds.

The software problem is subtler and, historically, deadlier. Nvidia's share of inference silicon still sits somewhere in the 60-to-75-percent range, and CUDA is the reason a working GPU deployment is hard to walk away from. A general-purpose GPU absorbs whatever the model layer throws at it. A fixed-function ASIC does not, and a transformer-only part that cannot run the next architecture inherits a worse form of lock-in than the one it was built to escape, this time to a bet about model shape rather than to a vendor. When d-Matrix, one of the more credible inference-ASIC companies, needed to compete, it did not need a faster chip; it acquired a data-center interconnect business to make its silicon usable at rack scale. Inference at scale is a systems problem (memory, interconnect, scheduling, compiler maturity), and the silicon is only the part that photographs well. The startups that have failed mostly did not fail on transistors.

And the margin that gets "escaped" is partly just relocated. An ASIC's bill of materials is TSMC wafers, TSMC packaging, and HBM from SK Hynix, Samsung, or Micron, all of it constrained and priced accordingly. The memory piece is not abstract: the reorganization of supply around data-center AI has grown severe enough to ripple into consumer product shortages, which is what a binding constraint looks like from the outside. Stop paying Nvidia's margin and you start paying more of TSMC's and the memory oligopoly's. The toll booth moved. It did not close.

The one design routing around the memory wall

Qualcomm's AI200 is the exception that makes the rule legible. Instead of HBM, each AI200 card carries 768 GB of LPDDR memory in a near-memory configuration. LPDDR is cheaper, far more available, and nowhere near the CoWoS-and-HBM chokepoint that constrains everyone else. Qualcomm is trading peak memory bandwidth for capacity, cost, and supply security, and for a large class of inference workloads, where fitting the model and its context in memory beats saturating bandwidth, that trade pays. It is the one design in this group that genuinely sidesteps the wall the others are queuing at. The later parts extend the wager rather than abandon it: the AI250 and AI300 move to what Qualcomm calls High-Bandwidth Compute memory, which the company positions as an alternative to HBM outright. That is a vendor claim until systems ship, but it is a second deliberate step around the chokepoint, not back into it.

That routing-around is also where the sovereignty dimension enters, through a real deployment rather than a slogan. Qualcomm's named launch customer for the AI200 and AI250 is HUMAIN, the Saudi Arabian venture standing up 200 megawatts of Qualcomm racks from 2026 for what it calls global inferencing, with a Qualcomm AI Engineering Center in Riyadh alongside it. The two threads of the week cross here: the design that most cleanly escapes the shared supply-chain chokepoint is also the one anchoring the most sovereign deployment. The clearest technical escape route and the clearest sovereignty story are the same story.

Escaping Nvidia is not escaping Taiwan

The premise behind Jalapeño, Dragonfly, and Sohu is a form of compute independence: buyers refusing to route their most expensive recurring workload through a single vendor's margin. But independence from Nvidia is not independence from the supply chain Nvidia also depends on. Jalapeño's silicon (reported, though never confirmed, as 3nm-class) and Sohu's N4P both come off TSMC's leading-edge lines in Taiwan. The high-bandwidth memory the HBM-based designs depend on comes from the same three-vendor pool concentrated in Korea and the United States, the concentration that also puts Korean HBM inside Nvidia's own next-generation Vera Rubin platform and shows no structural sign of loosening.

For the research-computing and supercomputing community whose scientific workloads share these same fabs and memory lines, the second-order question matters as much as the first. If custom silicon pulls commercial inference demand off the merchant-GPU market, it could ease the packaging and memory pressure that crowds science out of the queue; if it adds three more high-priority buyers to the same TSMC and HBM lines, it tightens the squeeze. Only the AI200's LPDDR bet plausibly relieves that pressure rather than adding to it.

The industrial fact underneath all of it is slow to change. This week proved there is more than one way to build an inference chip. There is still, for now, essentially one place to manufacture the advanced version, and one narrow band of vendors to give it memory. Custom silicon changes who designs the accelerator. It has not yet changed who the world has to ask for permission to build it.

AI Infrastructure NVIDIA Semiconductor Manufacturing Inference Economics

The three bets, and what is confirmed

The counter-argument is the spine, not a footnote

The one design routing around the memory wall

Escaping Nvidia is not escaping Taiwan