High-Performance ComputingHPCAnalysis

Slingshot Held Performance Under AI Traffic Patterns That Collapsed InfiniBand by 5x on Production Exascale

ISC 2026 research on LUMI, Leonardo, CRESCO8: Slingshot held performance; InfiniBand collapsed 5x under Incast, the AI gradient-sync traffic pattern.

Network flows converging at a switch node, with orderly blue streams on the inbound side becoming chaotic amber tangles on the outbound, visualizing the Incast congestion pattern where InfiniBand collapsed while Slingshot maintained performance. — The Incast pattern converges many senders on few receivers, creating edge congestion that traditional fabric benchmarks miss.SCN / AI Generated

Facility architects selecting interconnects for mixed AI and supercomputing workloads now have peer-reviewed quantification of a fabric vulnerability invisible to traditional benchmarks: HPE Slingshot maintained performance within 5% of baseline across all congestion scenarios tested on LUMI, while NVIDIA HDR InfiniBand on Leonardo collapsed to 20% of baseline (a 5x slowdown) under steady Incast congestion, the traffic pattern that AI training workloads create during gradient aggregation bursts. The findings, from research accepted to ISC High Performance 2026, represent the first side-by-side experimental comparison of modern InfiniBand, Slingshot, and Ethernet ecosystems under both steady-state and bursty congestion on production exascale systems.

The research team, led by Lorenzo Piarulli and Daniele De Sensi at Sapienza University of Rome with co-authors from ENEA, CINECA, the Open Ethernet Innovation Hub, and Huawei, tested three EuroHPC and Italian national systems plus a research testbed: LUMI (ranked 9th on the TOP500 as of June 2025, HPE Slingshot-11 interconnect), Leonardo (ranked 10th as of June 2025, NVIDIA HDR InfiniBand), CRESCO8 (NVIDIA NDR InfiniBand), and Huawei's Nanjing lab testbed (RoCE with Network Scale Load Balance). The tests characterized congestion behavior under scenarios ranging from 8 to 256 nodes, using both steady-state congestion and bursty traffic spikes designed to replicate production supercomputing and AI workloads.

The Congestion Characterization Gap

Prior fabric performance studies focused on single architectures in isolation or relied on simulation rather than production hardware. No published research had directly compared InfiniBand, Slingshot, and Ethernet congestion control behavior at exascale under the traffic patterns that mixed AI and supercomputing workloads generate in practice. Traditional benchmarks measure peak bandwidth and latency under ideal conditions, not fabric resilience when competing flows saturate network resources. The research team designed a victim-aggressor methodology to expose how congestion control mechanisms respond when communication rate exceeds what the fabric can sustain without performance collapse.

Systems Tested

System	Operator	Interconnect	Topology	Link Rate	Nodes	TOP500 Rank
LUMI	CSC (Finland) / EuroHPC	HPE Slingshot-11	Dragonfly	800 Gb/s (4x200 Gb/s)	2,978	9 (June 2025)
Leonardo	CINECA (Italy) / EuroHPC	NVIDIA HDR InfiniBand	Dragonfly+	400 Gb/s (2x dual-port HDR100)	3,456	10 (June 2025)
CRESCO8	ENEA (Italy)	NVIDIA NDR InfiniBand	1.67:1 blocking fat-tree	200 Gb/s (dual-port ConnectX-7)	760	Ranked
Nanjing Lab	Huawei Research	RoCE with NSLB (CE9855)	2-spine/2-leaf	200 GE	8	N/A (research testbed)

Four co-authors are Huawei employees, and the Nanjing testbed tests Huawei's CE9855 switch with Network Scale Load Balance. The headline findings involve LUMI, Leonardo, and CRESCO8 (non-Huawei systems).

Methodology

The team tested two congestion patterns and two traffic types. Steady-state congestion maintains constant traffic load; bursty congestion alternates between high-intensity communication bursts and idle periods. AlltoAll congestion distributes traffic across many paths, creating intermediate-switch bottlenecks. Incast congestion converges many senders to few receivers, creating edge-localized congestion at the destination.

The victim-aggressor model ran communication benchmarks (victim flows) while injecting congestion traffic (aggressor flows) on separate node sets sharing the same fabric. Tests measured the ratio of congested performance to uncongested baseline performance for victim flows. A ratio near 1.0 indicates the fabric protected victim flows from congestion. Ratios significantly below 1.0 indicate congestion control failure.

Quantified Degradation by System and Pattern

LUMI maintained congested/uncongested performance ratios within 5% of 1.0 across all scenarios tested: steady and bursty congestion, AlltoAll and Incast patterns, 8 to 256 nodes. According to Section V.C and VI.C of the paper, Slingshot's fine-grained flow tracking and per-flow congestion control effectively protected victim flows under both intermediate-switch and edge-localized congestion.

Leonardo handled AlltoAll congestion well, maintaining near-baseline performance when congestion occurred at intermediate switches. Under steady Incast congestion at 32 to 64 nodes, Leonardo's performance collapsed to 20% of baseline, a 5x slowdown. The paper attributes this to edge-localized congestion overwhelming InfiniBand's congestion control mechanisms. When many senders target few receivers, the destination network interface becomes the bottleneck, and path diversity (InfiniBand's strength for handling intermediate congestion) cannot mitigate the problem. The paper's bursty Incast tests (Section VI) showed qualitative degradation without the specific quantified ratio, confirming that InfiniBand's congestion control struggles under both steady-state and bursty edge congestion.

CRESCO8, running newer NDR InfiniBand than Leonardo's HDR generation, still dropped to 45% of baseline under AlltoAll congestion at 256 nodes and 60% of baseline under Incast. The paper notes that CRESCO8's 1.67:1 blocking fat-tree topology compounds congestion effects at scale compared to Leonardo's Dragonfly+ topology, and that newer network generation alone does not overcome topology and tuning constraints.

The Nanjing testbed demonstrated that RoCE with Network Scale Load Balance enabled maintained baseline performance under congestion. With NSLB disabled, performance dropped to 67% of baseline, validating that Ethernet congestion control requires load balancing mechanisms beyond standard DCQCN.

The paper summarizes five observations. Systems can exhibit congestion effects even without external congestion injection if communication rate exceeds what congestion control can sustain. CRESCO8 and Leonardo use similar network technology but respond differently, attributed to network generation, topology, and tuning differences. Bursty congestion exposes reactive handling limits more than steady-state congestion. Edge-intensive congestion (Incast) remains the dominant challenge across fabrics. Physical topology alone does not dictate saturation behavior (it is the combined impact of technology generation, topology, congestion control algorithms, and adaptive routing tuning).

Why Incast Is the AI Pattern

The paper explicitly connects bursty Incast to AI workloads: "This pattern closely resembles communication phases in production HPC and AI workloads, such as distributed deep learning, where gradient aggregation follows each optimization step and produces periodic spikes in network utilization." In distributed training, worker nodes compute gradients independently, then synchronize by sending gradients to parameter servers or using AllReduce collectives. This creates periodic traffic bursts converging on receivers (the Incast pattern). Between synchronization steps, the network is idle or lightly loaded (the bursty characteristic). The steady-state Incast test isolates the fabric's fundamental congestion control capability under sustained edge congestion, the prerequisite for handling the bursty real-world case.

The Meta Validation

Meta's SIGCOMM 2024 paper validates that AI training creates the exact congestion pattern this research identifies as the fabric killer. Meta reported pivoting away from standard DCQCN congestion control for their 24,000-GPU RoCE training clusters, instead using deep-buffer switches and application-level flow control. The paper states that standard Ethernet congestion control cannot handle AI's bursty gradient aggregation patterns without severe performance degradation. Meta's operational experience at production scale confirms what the ISC research quantified in controlled experiments: bursty Incast is the AI workload signature, and standard fabric congestion control fails under that load.

What This Means for Facility Architects

Facility architects designing infrastructure for mixed AI and supercomputing workloads face a procurement decision backed by quantified failure modes. Traditional supercomputing workloads tend toward AlltoAll communication patterns (simulations exchange boundary data with neighbors, creating distributed traffic). AI training workloads create Incast patterns during gradient synchronization. A fabric that handles AlltoAll congestion well but collapses under Incast will perform acceptably for traditional supercomputing and fail for AI training.

The research demonstrates that InfiniBand's congestion control handles AlltoAll scenarios effectively but cannot protect performance when edge congestion dominates. Slingshot's per-flow congestion control maintained performance across both patterns. For facilities running only traditional supercomputing workloads, InfiniBand's demonstrated AlltoAll resilience may suffice. For facilities planning mixed workloads or future AI adoption, the Incast failure mode is a blocking risk.

For facilities with existing InfiniBand deployments, the migration calculus involves switching costs, staff retraining, and application compatibility testing. Existing InfiniBand shops with workloads that remain AlltoAll-dominant may defer fabric transitions until natural refresh cycles. For greenfield builds planning to run AI training workloads alongside traditional supercomputing, fabric selection can optimize for edge congestion resilience from day one without migration overhead.

Multi-tenant environments compound the problem. If one tenant runs AI training workloads that create Incast congestion, victim flows from other tenants experience degraded performance even if those tenants are running traditional supercomputing applications. The research's victim-aggressor methodology directly models this scenario. Operations teams setting service level expectations for multi-tenant systems can use the quantified degradation ratios (5x slowdown at 32 to 64 nodes on Leonardo) to estimate performance variance under congestion.

The UEC Connection

HPE contributed Slingshot's congestion management approach to the Ultra Ethernet Consortium, which shipped the UEC 1.0 specification in June 2025. UEC 1.0 explicitly includes Receiver Credit Congestion Control (RCCC) for Incast scenarios (the pattern where InfiniBand collapsed in this study). The research provides empirical validation that the congestion control mechanism UEC standardized works at production exascale under the exact traffic pattern that breaks competing fabrics.

UEC's goal is to bring supercomputing-grade fabric performance to commodity Ethernet, enabling facilities to use Ethernet switching silicon and optics while achieving congestion resilience comparable to proprietary fabrics. The ISC research demonstrates that Slingshot, which uses Ethernet as its physical layer, achieved that resilience on LUMI. If UEC-compliant hardware implementations deliver equivalent congestion control, the standardization thesis holds. If they do not, UEC becomes a specification without demonstrated production performance.

The NVIDIA Response

NVIDIA is responding to AI fabric demands with Spectrum-X Ethernet, bringing InfiniBand-derived innovations to Ethernet rather than addressing InfiniBand's documented congestion control weaknesses. According to TrendForce's October 2025 report "InfiniBand vs Ethernet: Broadcom and NVIDIA Scale-Out Tech War," NVIDIA's Spectrum-X1600 (102.4 Tbps switching capacity) will not ship until the second half of 2026, leaving NVIDIA a year behind Broadcom on high-bandwidth Ethernet silicon availability.

NVIDIA has not publicly claimed to have fixed InfiniBand's Incast congestion vulnerability. The research demonstrates that even NDR InfiniBand, NVIDIA's newer generation, exhibited significant performance degradation under Incast on CRESCO8. NVIDIA's product roadmap signals a strategic hedge: continue InfiniBand for customers with established deployments and AlltoAll-dominant workloads, while building Spectrum-X for customers prioritizing AI training and Ethernet commodity ecosystems.

What to Watch

UEC-compliant hardware demonstrations at SC26 in November 2026 will test whether standardized Ethernet can deliver Slingshot-equivalent congestion performance. UEC 1.0 shipped as a specification; hardware implementations are the validation milestone. If UEC Ethernet matches Slingshot's measured resilience under Incast, it confirms that the congestion control approach can be implemented on commodity Ethernet silicon. If demonstrations show significant performance gaps, it indicates that proprietary fabric advantages remain despite standardization.

NVIDIA's response to the documented Incast vulnerability will signal their strategic direction. NVIDIA could announce InfiniBand XDR generation improvements addressing edge congestion control, confirming they view InfiniBand as their long-term supercomputing fabric. Alternatively, NVIDIA could accelerate Spectrum-X positioning and marketing while leaving InfiniBand congestion control unchanged, signaling that Ethernet is their AI fabric answer and InfiniBand serves legacy and AlltoAll-dominant deployments. GTC 2027 or NDR/XDR product updates are the likely disclosure points.

HPE now has peer-reviewed evidence of Slingshot's superiority over InfiniBand on EuroHPC systems. Whether HPE explicitly markets Slingshot's congestion advantage versus InfiniBand in sales materials and competitive positioning will indicate how aggressively they intend to use the research findings. HPE Discover 2026 or subsequent product announcements are the venues to monitor.

The research team notes follow-on work extending this methodology to application-driven traces rather than synthetic benchmarks. Production workload validation would strengthen the findings by demonstrating that real applications experience the same degradation patterns the controlled experiments measured. ISC 2027 or SC27 are likely publication targets.

The Bottom Line

Facility architects selecting interconnects for mixed AI and supercomputing workloads now have quantified evidence that fabric congestion control behavior under Incast traffic (not peak bandwidth or baseline latency) determines whether AI training workloads will experience predictable performance or 5x slowdowns at scale. InfiniBand's path diversity handles intermediate-switch congestion effectively but does not protect against edge-localized congestion when many senders target few receivers. Slingshot's per-flow congestion control maintained performance across both congestion modes on production exascale hardware. The strategic question is whether UEC-standardized Ethernet can deliver equivalent resilience on commodity silicon, making Slingshot's demonstrated advantage available outside HPE's ecosystem, or whether congestion control at this level remains a proprietary fabric capability that standardization cannot replicate.

AI Infrastructure Data Center Infrastructure NVIDIA Exascale Computing ISC