Lenovo became a powerhouse in the HPC and supercomputer spaces back in 2014, when it bought IBM’s System x server division for $2.1 billion in a deal that also saw it license storage and system management software from Big Blue. The acquisition vaulted the company into a highly competitive field that includes the likes of Hewlett Packard Enterprise, Dell EMC, IBM (with its Power systems), Fujitsu, and Atos.
The HPC space has only become more competitive in the intervening years, with HPE bolstering its capabilities with the acquisitions first of SGI two years later for $275 million and then supercomputer pioneer Cray for $1.3 billion in 2019 and Chinese vendors like Inspur and Sugon on the rise.
Still, Lenovo has been able to build on the foundation of the IBM deal to grow its HPC business. In May the company said that during its fiscal fourth quarter, its Data Center Group saw revenue jump 32 percent year-over-year to $1.6 billion, with record revenue in a number of areas, including HPC and artificial intelligence. In addition, in the latest Top500 list of the world’s fastest supercomputers released in November 2020, 182 of those systems were based on Lenovo systems, accounting for 36.4 percent of the supercomputers on the list. Coming in at number two was Inspur, with 66 systems.
Number 15 on the list is SuperMUC-NG, a water-cooled supercomputer housed in the Leibniz Supercomputer Center (LRZ) (in the feature image above) at the Bavarian Academy of Sciences and Humanities in Germany. Work on the system started in 2017 and was completed a year later, powered by 6,500 two-socket ThinkSystem SD650 “thin nodes” and 305,856 Intel 3.1 GHz Xeon Platinum 8174 cores, all delivering up to almost 26.9 petaflops of performance.
Workloads running on the supercomputer have ranged from simulation and modeling to newer compute- and memory-intensive tasks ranging from automating image and pattern recognition in planet observations and processing climate data to running medical visuals and health records as well as data demographics.
Now SuperMUC-NG is about to undergo a round of upgrades to enable the supercomputer to better leverage AI to run the advanced simulations, modeling, machine learning and data analysis jobs that are becoming more commonplace and to do so in a more power efficient way. Lenovo this week announced the launch of Phase Two of the system. The work will not only create a more powerful system to handle these advanced workloads but also help accelerate the push to make AI more accessible to organizations outside of the traditional HPC realm, according to Scott Tease, vice president and general manager of HPC and AI at Lenovo.
“AI is increasingly being seen as tool in HPC workloads, large and small,” Tease tells The Next Platform. “Researchers use AI to do a greater depth of data analysis and spot anomalies or variations in those mega-large data sets. This applies to large-scale research work on bioinformatics, climate or space as well as in CAE and CFD workloads used in engineering and manufacturing around the world. We’re entering an era where compute power itself may no longer be the gating factor to innovation and research. The exascale-driven innovations we are seeing will usher in an era where more people than ever have access to large HPC performance capabilities, both at a peta- and exascale level.”
At the same time, the industry is “entering an era where adjectives like ‘sustainable’, ‘green’ and ‘carbon neutral’ are being associated with everything. HPC is no different. As long as there are problems to solve that require compute power, customers like LRZ are going to move innovation alongside of energy efficiency and the use of efficient liquid cooling will grow in demand,” he says.
Phase Two will rely heavily on the latest technology from Intel. It will include 240 compute nodes that each will house two Intel “Sapphire Rapids” Xeon Scalable processors (due out later this year) and four of Intel’s upcoming “Ponte Vecchio” Xe-HPC GPUs designed for supercomputers running in what Tease calls ThinkSystem SD6450 “fat nodes.” The system will deliver more than 13 petaflops of performance, he says. Overall, the compute nodes in the second phase will deliver four times the performance-per-watt of those in Phase One.
A key feature of the upgrade will be Lenovo’s the Distributed Asynchronous Object Storage (DAOS) system, which runs on high-performance solid-state technologies like SSDs and NVMe rather than on spinning disks. This enables DAOS to bypass the OS to drive ultra-low latency, which is “crucial for the mega-large datasets used in modelling and simulation HPC workloads,” Tease says.
DAOS will use Intel’s “Ice Lake” Xeon SP CPUs integrated into Lenovo’s ThinkSystem 1U SR630 V2 platform. It will provide a petabyte of data storage and deliver fast throughput for large data volumes.
To drive the power efficiency, Lenovo also will bring in its Neptune direct warm-water cooling solution (below) that will be connected to DAOS through a high-speed network. Liquid cooling has been used in datacenters in limited fashion for years, but the rise of AI, the expansion of HPC workloads into traditional enterprise IT environments and the increased density in datacenters have renewed interest in the technology.
Liquid cooling is more efficient and less expensive than air cooling and the Neptune system can remove about 90 percent of the heat from a compute system, which reduces overall power consumption and enables processors to run at peak performance. The benefits are particularly true for warm-cooling, Tease says.
“Warm water cooling inherently saves energy and cost because it does not require chillers to cool the water before it is pumped through the system,” he says. “The water can be reused for things like building heat or sent to an absorption chiller, where the stored energy can be recycled to create cold water for other purposes. In either case, the warmer the water the better – reuse of this energy source takes what used to be a waste product (heat output) and turns it into a valuable commodity. In addition to the cost, [ecological] and operational benefits of warm water cooling allows our systems to support higher power/performance CPUs and GPUs beyond what air cooling allows.”
HPC organizations take notice when a high-profile site like LRZ uses liquid cooling for x86 servers, according to Tease. Lenovo now has Neptune systems running in North America, Europe, Asia and Australia. Many are able to reduce the number of racks needed to run the same workloads. Now a single rack of Lenovo’s SD650 systems with GPUs can deliver the performance of a supercomputer that would rank among the fastest 300 on the Top500 list, a trend that will expand access for researchers needing such supercomputing capabilities, he says.
A hurdle to liquid cooling for corporate datacenters is the expense and difficulty of installing the necessary plumbing, but solutions like Lenovo’s ThinkSystem SR670 V2 uses liquid cooling technology that is contained inside the server itself, eliminating the need for plumbing, with such designs showing HPC and enterprise organizations that liquid cooling can be used inside air-cooled datacenters.
The DAOS system for Phase Two will come in in the last quarter this year, with the compute system being delivered in the second quarter 2022.