HBM4 doubles bandwidth per package and still can't share capacity across a rack. The two architectural answers, pooled DRAM over CXL and compute pushed next to memory, are at very different stages of readiness.

At HPE Discover in Las Vegas this month, SK hynix set up a memory zone that read like a maturity gradient. On one side sat HBM4 and HBM3E stacks, the parts that feed today's accelerators. In the middle, two generations of its CXL memory module sat side by side: a first-generation 128GB CMM-DDR5 on CXL 2.0+, and a second-generation 256GB CMM-DDR5 built to CXL 3.2. One of those modules wasn't behind glass. It was running inside Liqid's CXL pooled-memory server, an interoperability demo of pooled capacity across a real host.
A few feet away sat the architecture SK hynix is proposing rather than shipping. Processing-near-memory, presented as a concept that pairs CXL-pooled capacity with compute logic placed next to the DRAM. No product, no date, just a slide and an argument.
On one side, a CXL module being demonstrated in a real pooled-memory server configuration. On the other, a near-memory compute architecture that exists as a roadmap. It is also the cleanest illustration of the substrate question this pillar of Supercomputing News keeps returning to: which emerging architectures cross from frontier into the foundation of the stack, and which stall as interesting research. The same frontier-versus-foundation test we ran on what has actually flown versus what is slideware in orbital compute, and on the gap between chiplet architecture slides and production silicon, applies cleanly to memory. Pooled CXL memory is moving from demo into deployable infrastructure. Processing-near-memory is still earlier: technically promising, but gated by software, latency economics, and programming models. Both are aimed at the same wall.
The memory wall is old news. Wulf and McKee named it in 1995, when they observed that processor speed was pulling away from memory speed and that the gap would eventually dominate performance. Three decades later the gap hasn’t closed, it’s changed shape.
For large language model inference, the binding constraint often isn't floating-point throughput. For many transformer deployments, decode is memory-bandwidth-bound. Generating each token means reading a large slice of the KV cache out of HBM, and that working set grows with sequence length and model width, so the accelerator spends much of its time waiting on memory rather than doing math. Arithmetic intensity is low. The FLOPs sit idle while the bytes move.
It helps to split the wall into two constraints that get conflated, because the two architectural responses map onto them differently. The first is bandwidth per GPU: how fast a single accelerator can pull bytes from its attached memory. That governs decode throughput and latency. The second is capacity per GPU: how much memory a single accelerator can reach at all. That governs context length, model size, and how many concurrent sessions you can hold. HBM4 is a direct, brutal answer to the first. It does almost nothing for the second.
The HBM4 standard, JESD270-4, was finalized by JEDEC in 2025, and the numbers are a genuine generational jump. The interface widens to 2,048 bits per stack, double HBM3, running up to 8 Gb/s per pin, for as much as 2 TB/s per stack. Thirty-two channels. Stacks up to 16-high, up to 64GB each. (JEDEC's HBM4 announcement lists the parameters.) Micron has HBM4 on its production roadmap for 2026, and Nvidia's Rubin-generation accelerators are widely reported to be built around it. The premium is real: HBM4 is reported to carry a price increase of roughly 30% or more over HBM3E, and supply is allocated long before it's manufactured, the system-planning problem we covered when HBM scarcity became an allocation story rather than a supply one.
Here is what HBM4 isn't built to do: serve as rack-shareable pooled capacity. The stack is wide, short-reach, and bonded to one package over a silicon interposer. It is fast precisely because it is close, and being close means it belongs to one accelerator. There is no JEDEC standard for coherent HBM shared across packages, and the route JEDEC has defined for more per-package capacity, the stacked variants in the HBM4 family, keeps the memory inside the package. So HBM4 raises the ceiling on bandwidth per GPU while leaving capacity per GPU bounded by what you can physically stack on one part. If your problem is that eight accelerators each have memory the other seven can't touch, a faster stack doesn't help. You need a different topology.
The precision matters here, because the marketing around CXL has been muddy, and a Practitioner Core reader will catch it if the draft gets it wrong.
Compute Express Link did not invent memory pooling in its latest revision. Pooling arrived in CXL 2.0, which added switching and device partitioning, letting a single memory device be carved into disjoint segments that different hosts own. Coherent sharing, where multiple hosts work a common region, and true fabric routing came later. Those landed in CXL 3.0 (August 2022), which moved to the PCIe 6.0 PHY at 64 GT/s with PAM4 signaling, added multi-level switching, and defined port-based routing scaling toward 4,096 nodes. CXL 3.1 (November 2023) extended the switching and added trusted-execution security. Then CXL 3.2, released in December 2024 and the spec behind SK hynix's second-generation module, made its headline additions elsewhere. Its marquee features were not new pooling, sharing, or fabric semantics but the CXL Hotness Monitoring Unit (CHMU), a hardware mechanism for tracking which pages are hot so software can tier data between fast local memory and slower pooled memory, alongside device-management and security (TSP/IDE) enhancements.
That distinction isn't pedantry. It tells you where the maturity actually sits. The capability doing the work in that Liqid demo, pooling disjoint capacity across a host, is a CXL 2.0 feature that has been available in shipping CXL 2.0 devices. The 3.2 module is newer and denser, but the architectural move it performs was specced years ago. Pooled and expansion-class CXL memory is the part closest to foundation.
The vendor evidence backs that up, and the clearest example isn't a memory maker at all. It's Marvell's Structera line, closer to a shipping foundation than anything else in this space. By Marvell's own figures, Structera A is a near-memory accelerator: up to 16 Arm Neoverse V2 cores on the controller, up to 200 GB/s of memory bandwidth, inline LZ4 compression, with the company citing about 5x the vector searches per second on memory-bound workloads. Structera X is the expansion play, a controller that lets a server reach far more DDR5 than its own channels allow, with DRAM-reuse economics that matter when you're redeploying older memory instead of buying HBM. Marvell has shown Structera A and X interoperating across AMD EPYC and Intel Xeon hosts and all three major DRAM suppliers. Its Structera S switches go further: the S 30260, a PCIe 6.0 / CXL 3.x part Marvell positions for up to 48TB of pooled memory at 4 TB/s across 260 lanes, is slated to sample in Q3 2026.
SK hynix and Samsung fill in the module side, SK hynix with CMM-DDR5 and Samsung with its own CXL memory expanders, and the controller ecosystem around them is real and growing, from Astera Labs' Leo and Taurus parts to Montage's switches. More than any single spec sheet, the interoperability story is what tells you pooling has arrived as something you can build on rather than something you demo once.
Pushing compute next to the memory is the more aggressive idea, and it's the one still stuck on the frontier side of the substrate question.
First, a distinction vendors routinely blur. Processing-in-memory (PIM), sometimes called compute-in-memory, puts arithmetic inside the memory array itself. Processing-near-memory (PNM) puts logic adjacent to the memory, on a logic die in the stack, in a buffer chip, or on the module, close enough to cut the data-movement tax without rebuilding the DRAM cell. Both often get branded "PIM" in press materials. They are not the same thing, and the near-memory variant is the one with a plausible near-term path, because it doesn't require reinventing how DRAM is manufactured.
The frontier-versus-foundation map across vendors is lopsided. SK hynix's AiM and AiMX accelerators sit at pilot and concept stage. Samsung's HBM-PIM (Aquabolt-XL), its AxDIMM, and its CXL-PNM work are prototypes and partner co-designs, the CXL-PNM piece demonstrated against Meta's DLRM recommendation workloads, none of them yet a broad, general-purpose server building block. Even Marvell's Structera A, the closest thing to a shipping near-memory part, is an accelerator with cores on a CXL controller. That is near-memory in the loosest sense, not the dense compute-beside-DRAM vision the PNM concept slides promise.
The constraint that keeps PNM frontier-side is the same one that limits all of CXL: latency. A pooled or near-memory access runs on the order of several hundred nanoseconds, against local DDR5 that often lands closer to 100 nanoseconds, and routing that access through a switch only widens the gap. The exact figures are workload- and platform-dependent, but the direction holds. Pooled and near-memory both trade latency for capacity, and whether that trade pays depends entirely on software that doesn't fully exist yet. OS and runtime support for tiered memory, deciding what lives in fast local DRAM versus slow pooled DRAM, is exactly what CXL 3.2's CHMU is built to feed, and it is still maturing. The near-memory programming model, where compute happens somewhere other than the CPU or GPU, is even less settled. The high-lane CXL 3.x switches that make rack-scale pooling practical are only sampling in Q3 2026. And the bill of materials for controllers, retimers, and switches makes this a hyperscale and high-memory-footprint play, not something that shows up in a general-purpose server next year.
One sourcing point worth stating plainly: SK hynix attached no public timeline to its PNM concept, and inventing one would be dishonest. The shipping memory tier underneath all of this, SK hynix, Samsung, and Micron, is concentrated in Korea and the US, and those same suppliers have reorganized their output around data-center AI. That matters for anyone modeling supply risk, though it isn't the story here.
Pooled and expansion-class CXL memory is moving from demonstration into deployable infrastructure. The spec is years old, the silicon is interop-proven across CPU vendors and DRAM suppliers, and a 256GB module ran in a pooled server on a trade-show floor this month. Processing-near-memory isn't there yet. It's been demoed, it's gated on latency, software, and cost, and there is no high-volume part you can order. SK hynix's booth showed both at once because both are true at once, a working module next to a concept slide.
Which leaves the question neither the demo nor the spec sheets answer. Whether the tiered-memory and near-memory software stack matures fast enough to make the latency trade worth taking outside the hyperscalers who can absorb the cost. That's less a silicon problem than a software-ceiling one, and the people writing the page-placement and near-memory runtimes will settle it well before the next module ships.