How has software been impacted by the Exascale Computing Project (ECP)?
The Department of Energy sponsors and the Advanced Scientific Computing Research (ASCR) Programme Office in the Office of Science, The National Nuclear Security Administration, the Advanced Science and Simulation and Computing Office realised that we [HPC community] needed to make a concerted investment in software. That's something that, as application and software developers, we couldn't agree with more. Often these sorts of investments are driven by the passion and commitment of the scientists. In other words, we had not developed - I think we are now - but, at that time, we had not developed as much as we should have, the value system - the value of outstanding software developers.
At ECP, our sponsors realised that they needed to make a heavy investment in software many years before the arrival of a new system. While that's happened, in bits and pieces in the past, and always has. This massive investment in software that is not boutique software for one system. But the software tools and technology that will be the scientific and engineering tools for our nation and, in many cases, the world for decades to come. A good application can live for decades through many systems. And so this isn't an ephemeral one-off and done exercise. It's like constructing a large scientific instrument. In this case, our applications are the beginning of a new app store for the nation. To draw on that Apple analogy, our software stack is a new dynamic iOS. This stuff is going to be around long past when I retire.
During my time at the Department of Energy, there have always been investments in software development, but generally, at least in my career, never in such a concerted, integrated way. By concerted, I mean ample investment for innovation, agility, and trial and error. We're all about agile software development. Sometimes that means “fail early and often”. The other thing is to bring all the activities together under one roof. And there were some growing pains there and some culture clashes initially, but that has given us a significant return on investment.
How can an organiosation encourage innovative software development?
The integration has bore more fruit than I ever thought possible. With ECP, we were afforded the opportunity to build integrated teams. And in some cases, we forced it because some domain scientists work in their bubbles. Application developers are often left to their own devices. If I am building a simulation tool to simulate a nuclear reactor, I am probably comfortable looking for other nuclear engineers and away we go. Well, it's not first nature to go after mathematicians, computer scientists, other computational scientists, statisticians, data scientists and so on to help us.
That doesn't imply that [traditional approaches] have yet to be successful. But using ECP to bring people together has led to substantial dividends. For example, we have built building abstraction layers, one out of Sandia called Kokkos, and one out of Lawrence Livermore called Raja, that demystify and, to some extent, hide the complexity of heterogeneous hardware.
Many of our application teams didn't know about that development because they were at other labs or institutions, or in some cases, they didn't care because they didn't think it would help them. But you put together a bunch of world-leading scientists and Principal Investigators, and they're competitive. They see other teams doing things that are perceived as moving ahead, or maybe, in reality, they are moving ahead, so they're one-upping each other and competing. It's like building a superstar soccer team. After a few years, they gel and realise that the whole is greater than the sum of the parts.
The fascinating thing with ECP is that we put together this huge project, and we've got 85 different teams working together and working with each other. We develop this inherent codependency upon each other that I've never seen. We are working with sponsors right now to ensure that this codependent ecosystem is sustained well beyond ECP, and I'm very confident that will happen.
How do accelerators impact the development of software to support scientific computing?
When we deployed the first system at Oak Ridge to use GPUs from Nvidia, Titan there was a leap of faith. It was a high-risk proposition. At the time, the GPUs from Nvidia had no error correction and were only 32-bit. We worked with them to create the ability to have 64-bit error correction, etc. Well, that's what everybody has, as well as mixed-precision performance. The point is, accelerators are here to stay.I call it an accelerator, not a GPU, because what we're seeing in hardware, which is very exciting, is hardware designed to accelerate certain operations.
The point about ECP is recognising that accelerated node computing is here to stay. And whether it's your laptop, desktop, a cluster down the hall, the cloud, or Frontier, a node will be an eclectic mix of hardware. If you don't have software that recognises this is a special piece of hardware, I know what to do here. I know how to lay out my data. I know how to utilise it to exploit all those floating point operations (FLOPs). You're going to be in trouble.
One of the points of ECP wasn't just to say, Hey, look at us. We've got 24 cool applications, come one, come all. It was more about first mover applications, applications that were mission-critical interests to the DOE that we needed to exploit the hardware, but also to show the way for the hundreds of other applications that we can lower the barrier to entry.
Douglas Kothe, Director of DOE's Exascale Computing Project (ECP), has more than three decades of experience in conducting and leading applied research in computational applications designed to simulate complex physical phenomena in the energy, defense, and manufacturing sectors.