Last week, we introduced the Perlmutter supercomputer, the next-gen system at NERSC that will likely secure the #5 spot on the Top 500 list of the world’s most powerful machines. In that piece we kept the conversation about compute and capabilities but the real star of the show is in the storage wings.
With 35 petabytes, the system will be the largest all-flash storage system we’ve seen to date but scale is only one part of the story. Instead of using one of NVMe-first file systems (a WekaIO or Vast Data, for example), the NERSC team, in conjunction with system partners Cray/HPE, will keep beefing up Lustre to meet demands that include more mixed workloads than previous systems with AI/ML increasingly in the mix. This means getting Lustre to perform beyond its bandwidth performance-oriented roots and provide sufficient IOPs and metadata handling.
So far, with that file system on the floor, Glenn Lockwood, NERSC storage and HPC architect, says Lustre is humming along and taking great advantage of all flash with “good enough” IOPS and metadata performance but outstanding for the high-bandwidth work so often at the heart of HPC. He explains that while they did spend time evaluating flash-oriented file systems, every test pointed to the efficacy of Lustre and with the help of HPE/Cray and a new Lustre research center, they could fill in any gaps.
When making the choice, Lockwood says it helped that HPE/Cray offered it as a supported file system option but “in 2018, that was a fair question—why put Lustre on this shiny new NVMe? We partnered as part of the contract for Perlmutter for a center of excellence around Lustre to take full advantage of NVMe and that’s paid off, it’s on the floor today and it’s fast. Really fast.”
As for the flash-first file system vendors, they hadn’t been tested at anything near Perlmutter scale, most at that point only had a few petabytes, max, under their belts. “Very few all-flash, single namespace parallel file systems have been deployed at 30 petabytes. That, and taking a big risk on a technology that’s only gone up to a few was part of the reason we chose Lustre along with the prospect of integrating into a complex environment.”
This is not NERSC’s first foray into the world of flash on large supercomputers, but their entry point was more experimental. One the previous-generation Cori supercomputer, NERSC, along with Los Alamos National Lab were blazing trails for the burst buffer. Each had over a petabyte of all-flash burst buffer that let them test the concept and the requirements for flash at scale. However, despite all the talk about what burst buffers might do for storage performance and efficiency over the last several years, it was functional but it took some extra effort from users for real benefit.
“In the years that followed [that burst buffer installation] we were able to see usage for the burst buffer and although it was proven fast and could enable new science for certain users, the fact that it’s ephemeral meant users had to explicitly manage data in and out. That provide to be enough of a barrier, especially since the Lustre file system for that machine was all disk-based and didn’t require moving data with each job.” Lockwood adds that at the time they couldn’t afford all-flash for Cori but as the planning for Perlmutter began, the equation changed rapidly.
Lockwood and team kept tight track of the evolving costs of flash in 2018 as it planned the Perlmutter machine with Cray/HPE.
“We used the best availability industry information about commodity pricing for flash and followed that on a quarterly basis through the ups and downs of the shift from 2D to 3D NAND and established a good sense of projected cost. Then we did a bit of gambling. We shared the risk with HPE and agreed we’d set a price for flash we’d pay based on what we thought 2020 would bring and if it was off, we’d revisit.” As it turned out, they were right on the money and the cost of that all-flash system is now just 10-15% of the total system acquisition—right in line with historic cost breakdowns for other big machines at NERSC like Cori and Edison.
“The tradeoff is that the capacity is now as big, relatively at least, at Cori, which had 30PB of disk while Perlmutter is 3-4X more capable but only has 35 petabytes of flash but to make ourselves comfortable, we took a look at our workloads and found 30 petabytes was sufficient.”
Right now, those tradeoffs, from the file system (one designed with NVMe in mind versus a more standard parallel one) to the capacity and performance appear sound. Lockwood says he continues to be impressed with Lustre’s performance on the all-flash system—and surprised that even out of the gate an unoptimized test run with Lustre on flash sung. But there is still quite a bit of work to do to get the most out NERSC’s big flash investment.
“Software will continue to be the challenge. Lustre is optimized for bandwidth and the emerging contigent of workloads are IOPS and metadata intensive. There are tradeoffs in software the flash-first file system makers made but also in Lustre to get maximum bandwidth. There is no way to get those three aspects of performance –bandwidth, IOPS, and metadata—without a lot of work in software to reconfigure the underlying flash.”