William Thigpen, assistant division chief of HPC operations for NASA’s Advanced Supercomputing Division, discusses the role the facility plays in helping the organisation to meet its goals.
NASA requires high-performance computing to further its missions for both exploration and research and scientific discovery.
To help meet those goals, NASA’s High-End Computing Capability (HECC) Project provides high-end computing, storage, and associated services to enable scientists and engineers supporting NASA missions to employ large-scale modelling, simulation, and analysis.
But meeting the varied user requirements is not an easy task. To deliver the highest possible amount of computing resources and performance, HECC has developed a modular supercomputing facility that houses the latest supercomputer on site, Aitken.
Aitken is a petascale supercomputer housed in a modular supercomputing facility at NASA’s Ames Research Center.
Following the success of the prototype modular facility housing the Electra supercomputer, the initial module and racks of the Aitken supercomputer were installed in summer 2019, the first part of a one-acre expansion site that has the potential to hold 16 modules for computing and data storage.
The aim for this facility and the system housed within it is to reduce power consumption and cooling costs so the money saved can be better spent on delivering computer performance, storage and ancillary services to the scientists and engineers using Aitken.
The success of the project means that this system is now being scaled up and will soon surpass the Electra system housed at the NAS facilities. William Thigpen, assistant division chief of HPC operations for NASA’s Advanced Supercomputing Division states: ‘Very soon, within a couple of months, Aiken will surpass Electra in terms of work being done, and will also pass Pleiades’ theoretical peak performance.’
‘We are in the process of expanding Aiken, it will grow to more than eight petaflops. That system is in this new modular facility. Electra was in our prototype facility, Aiken is going into the larger facility and we are still finding it very beneficial.
It has previously been reported that the power usage effectiveness (PUE) of this new facility was in the range of 1.03 to 1.05. Thigpen stressed that this has dropped slightly now that the facility is production ready.
‘It is still a very large reduction but it is not in the 90 per cent it is more in the 80 per cent saving.
‘I think one of the interesting things about the Aiken expansion is that, for the first time in a very long time, we are going to be putting in AMD processors instead of Intel processors. We are looking at this as a very large increase in compute capability.’
One of the primary missions of Thigpen and his colleagues at NAS and the HECC project is to provide computing resources for NASA users. Ultimately this means that whichever technologies can provide the most work or scientific results over a given time are selected.
As Thigpen mentioned for this latest upgrade AItken will use AMD processors and this switch is expected to deliver a large boost in performance. The current system is around one petaflop and, once upgraded, will deliver in the region of eight petaflops peak performance.
‘When we look at the type of processors that we are going to put into a system we look at how much work is going to be done by the hardware we are installing. That is where AMD really won out.
‘The work that was going to be done for the total cost of ownership, which includes the cost of electricity and all those things. But the big win is the performance that we are seeing per node with the AMD processor.
‘Unlike many computing centres, we are actually billed for our power so the money we save can go directly towards more computing.’
The changing needs of scientists and researchers
Providing supercomputing resources is one of the primary roles of the HECC but increasingly Thigpen and his colleagues are also exploring the benefits of other computing technologies for AI and ML workloads and quantum computing. As we are seeing with many HPC centres, the workloads and needs of scientists and researchers are shifting and resource providers must adapt to meet the demands of their users.
‘This [Aitken] is still a CPU-based system but we are looking at new technologies. We have also been growing our GPGPU environment. There is another increase in that environment coming also in the next few weeks.
‘We have found, across the agency, a lot more interest in doing machine learning, and using GPUs to do that,’ stated Thigpen. ‘The AMD processors give us 128 cores on a single node and that is something our users really want. The other thing that we are hearing from our users is that they really want to run larger jobs and that is being addressed with Aiken but with the GPU enhancements we are also trying to expand the number and size of our GPUs.
‘We are looking at the Nat Vector machine and we are evaluating the Arm processors from Fujitsu. There are things that are coming onto the marketplace that will provide big differences in the amount of work they are able to do,’ added Thigpen.
Data-driven research
HECC is becoming more and more involved in the research arm to ensure that not only computing systems are made available, but also that there is an ecosystem available to support future work and the changing needs of NAS facility users.
‘As we look at instruments that are being put into space, one thing that is true across the board is that these instruments will be pushing more and more data down to earth. It needs to be processed,’ stated Thigpen.
‘We also have to keep in mind as all of this is occurring, the role that citizen science is going to be taking in helping NASA meet its mission. There is a big drive to handle data much better and in a more open means,’ commented Thigpen.
The knock-on effects of changing requirements is that people running advanced computing facilities much continue to look further ahead to meet the demands as they appear.
Thigpen added: ‘In the past we have not been looking as far out into the future.’
The reason behind this is that as these environments change to accommodate AI or increased GPU usage codes must also change to take advantage of this new hardware. Adapting and editing a tried-and-tested code base is no easy task and so this takes time and resources. ‘There needs to be a preparation of the algorithms that are going to be used so they can operate well in the new environments that are coming out.’
In order to meet these challenges, facility managers must work with teams of researchers to identify where resources should be used to best support the user community and help it to transition.
Thigpen explained: ‘We are doing a combination of things from running user workshops, to continuing our five year plans, and working with users who are not our historical user base to make sure that we can really address NASA’s advanced computing needs.
Overcoming challenges
Delivering research services to a wide array of different types of users requires the managers and providers to meet and balance several competing challenges.
In order to provide a balanced service that can meet the requirements of as many people as possible. ‘
The challenge is being able to address the daily routine type jobs and still facilitate users who are trying to scale up and really want to be at the edge of high-end computing for NASA,’ said Thigpen.
‘That is a challenge and it is one of the areas that we have had feedback from the user community that they would like us to be able to better serve people who are trying to run at scale.’
‘There are also challenges in how people want to use the system. As people come on board that are used to environments like containers and other tools that have been available to them whereas your traditional users just want to run some Fortran codes.
‘We have a broad spectrum of types of jobs that are running from all forms of aeronautics, very engineering focused work being done for Artemis and for the International Space Station and on the other side we have research and discovery focused things like astrophysics and heliophysics.’
‘How much data they use, how they want to share that data, the size of jobs they run and how many jobs they run creates a wide spectrum and our job is to try and meet as many of the agency requirements as we can,’ Thigpen concluded.