This week at SC23, Microsoft Azure’s “Eagle” made its debut on the semi-annual Top500 record. Microsoft has a machine that has extra HPC compute energy than all however two techniques on the record. Maybe the best half: you need to use it. Whereas conventional HPC clusters are in-built locations like nationwide labs and entry is restricted to those that are accepted by the establishments that run them, Microsoft’s is a cloud resolution.
Microsoft Azure Eagle is a Paradigm Shifting Cloud Supercomputer
Eagle edged out the previous #1 Supercomputer Fugaku by Fujitsu and RIKEN that debuted in 2020. With 14,400 NVIDIA H100 GPUs (14400/8 = 1800 nodes?), InfiniBand, and Intel Xeon Sapphire Rapids CPUs, the brand new system turned in a #3 consequence solely round 4% lower than #2 on the November 2023 record.
At SC23 we bought to see a Microsoft Azure NDv5 node on show on the present ground.

This method is very large, however at its coronary heart is a block of 8x NVIDIA H100 GPUs on a HGX H100 8-GPU platform.

We had been advised that these are utilizing Intel Xeon Platinum processors from the 4th Gen Intel Xeon Scalable “Sapphire Rapids” line so they’re DDR5 processors.

In addition they haven’t simply Infiniband networking but in addition areas for Microsoft’s NIC resolution.

Placing collectively effectively over a thousand of those techniques with the entire networking, storage, and the entire different bits makes a #3 supercomputer.
Ultimate Phrases
A enjoyable X or Twitter nugget is that Microsoft’s benchmarking workforce solely bought three days with the system to show in a Linpack run for a Top500 consequence earlier than the system was then made accessible to prospects. Distinction that with the Aurora, the #2 system on this record, the place the run occurred on about half of the cluster and even that half was not operating at its ultimate anticipated efficiency. That is an superior consequence by Microsoft and actually exhibits a paradigm shift.

Microsoft has not only a succesful system, however a prime 3 system for conventional HPC workloads. The “soiled” secret is that Microsoft shouldn’t be utilizing the normal 4x GPU node. As an alternative, it’s utilizing the NVIDIA HGX H100 platform. That makes the Eagle supercomputer additionally a significant AI system utilizing at this time’s hottest GPUs and AI platforms. Whereas Microsoft’s HPC workforce was targeted on offering cloud options to huge FP64 on-prem supercomputers, it additionally constructed a large AI supercomputer given its structure competing face to face with NVIDIA Eos in MLPerf Coaching v3.1 outcomes on solely part of the Azure machine.

The dimensions and efficiency of Microsoft’s providing is plain at this level. Eagle is utilizing its Azure taste of a typical NVIDIA HGX H100 platform making it additionally a mainstream AI supercomputer, as a substitute of some type of HPC-only unique system structure. Maybe the perfect, and the paradigm-shifting nature of the brand new machine is the entry. One doesn’t must undergo a website choice, 6+ month set up and bring-up, and so forth to make use of the system increasing the accessibility of HPC. Eagle is usable by many shoppers in a cloud mannequin as a substitute of getting restricted entry by institutional gatekeepers (past Microsoft) which once more will increase the attain of HPC, as long as prospects pays for his or her use.