Each year the Association for Computing Machinery award the Gordon Bell Prize to recognise outstanding achievement in high performance computing (HPC). This year, five out of the six finalists are running on GPU-based system Summit, at Oak Ridge National Laboratory, or the Sierra system at Lawrence Livermore National Laboratory.
The finalist’s research ranges from AI to mixed precision workloads, with some taking advantage of the Tensor Cores available in the latest generation of Nvidia GPUs. This highlights the impact of AI and GPU technologies, which are opening up not only new applications to HPC users but also the opportunity to accelerate mixed precision workloads on large scale HPC systems.
Jack Wells, director of Science Oak Ridge Leadership Computing Facility (OLCF) comments: ‘All five of those projects have already demonstrated that they can use the system at full scale. That is pretty remarkable, because we are still kicking the bugs out of the system software and they were only using our test and development file system, because we are still final acceptance of the full 250 petabyte file system.’
Deep learning and AI research has been adopted relatively quickly by the HPC and scientific communities and this is demonstrated by the ACM Gordon Bell Prize finalists. This relatively new opportunity in computing – HPC systems loaded with GPUs, high bandwidth interconnects, and Tensor Cores which can accelerate mixed precision or deep learning workloads – is already impacting the way research is carried out on large-scale HPC clusters.
In 2017 the Gordon Bell prize was awarded to a team from China that worked on the Sunway TaihuLight system to simulate the most devastating earthquake of the 20th century.
Using the Sunway TaihuLight, which at the time was ranked as the world’s fastest supercomputer, the team developed software able to process 18.9 petaflops of data and create 3D visualisations of a devastating earthquake that occurred in Tangshan, China, in 1976.
The team was awarded the prize for their application, which included innovations that helped the researchers achieve greater efficiency than had been previously possible running software on the predecessor to the Summit system, Titan, and the Chinese TaihuLight supercomputer.
The recipients of the prize noted that they had worked on several innovations. This included a customised parallelisation scheme that employs 10 million cores efficiently; a memory scheme that integrates on-chip halo exchange through register communication, optimised blocking configuration guided by an analytic model, and coalesced DMA access with array fusion; on-the-fly compression that doubles the maximum problem size and further improves the performance by 24 per cent.
The Tangshan earthquake was one of the most damaging natural disasters of the 20th century. The earthquake in Hebei province resulted in between 242,000 and 700,000 deaths.
Understanding the phenomena and how it might impact the local population is hugely important to preventing extensive damage and loss of life during future disasters. In order to create as accurate a model as possible, the team developing the simulations for the Tangshan earthquake used input data from the entire spatial area of the quake, a surface diameter of 320 km by 312 km, as well as 40 km below the earth’s surface.
In 2018 several of the nominees for the award are employing deep learning, AI or mixed-precision computation that runs quickly on the Tensor Cores – a new hardware addition included with the Nvidia V100 GPUs to accelerate deep learning applications.
The finalists from 2018 include researchers working on weather and climate, earthquake simulation, genomics, electron microscopy and research that aims to quantify the lifespan of neutrons.
One of the projects that is a finalist for the Gordon Bell Prize, ‘Development of genomics algorithm to attain exascale speeds’, was developed by a team from Oak Ridge National Laboratory led by Dan Jacobson.
The team achieved a peak throughput of 2.31 exaops, the fastest science application ever reported. Their work compares genetic variations within a population to uncover hidden networks of genes that contribute to complex traits.
‘This is a very compute intensive task, because they are basically comparing the DNA of all of the samples for a given population. In the past it was thought to be this heroic thing that no one had enough compute to run but it’s intrinsically an integer code, it’s not really even a floating point code because you are comparing DNA sequences,’ added Wells.
Wells noted that, when Jacobsen and his team converted the integer code to mixed precision FP16 code that could be run on the Tensor Cores, they were able to get a four-times speed boost on the integer code running on the Summit system.
Overall the speedup from both algorithmic and hardware improvements has enabled this application not only to reach an unprecedented speed of 2.31 exaops, but also begin to compare genetic variations within a population to uncover hidden networks of genes that contribute to complex traits.
Geetika Gupta, product lead for HPC and AI at Nvidia, said: ‘Accelerators enable multi-precision computing that fuses the highly precise calculations to tackle the challenges of high-performance computing with the efficient processing required for deep learning.
‘There are many scientific applications that will be driven forward by the convergence of AI and deep learning. Something that started on the consumer side has been picked up by the scientific community, which is a great example of two different use cases coming together,’ commented Gupta.
Another Summit application has already broken the exaflop barrier running a deep learning application using FP16 computation. The team from Lawrence Berkeley Laboratory, led by Prabhat, are working on the project entitled: ‘Identification of extreme weather patterns from high-resolution climate simulations’ which aims to analyse how extreme weather is likely to change in the future.
The team has already managed to achieve a performance of 1.13 exaflops, the fastest deep-learning algorithm reported.
‘That is a more traditional deep learning neural network application of feature classification but they were able to use tensor cores as they were intended to be used for deep learning and we were able to scale up to an exaflop,’ said Wells.
Wells also explained that mixed precision was not something that had been discussed much in the early development of the Summit system or more widely by the HPC community. There have been examples of mixed precision HPC workloads in the past but it had not been particularly widespread within the community.
Wells added: ‘When we said that Summit was a 200 petaflop machine, it was left moot how much precision you have’.
Wells also stressed that these kinds of techniques were likely to become more widespread, as HPC users look for ways to increase performance: ‘With the end of Dennard scaling, getting performance out of computers is more difficult and people are going to start looking in the cracks and corners of the room for more opportunity to get performance.’
‘The earthquake research team from the University of Tokyo, which is also a Gordon Bell Prize finalist, described a trans-precision algorithm but they are actually not using the Tensor Cores yet because their problem is really more communication-bound’, added Wells.
Their project – ‘Use of AI and transprecision computing to accelerate earthquake simulation’ – used Summit to expand on an existing algorithm. The result was a four times speedup, enabling the coupling of shaking ground and urban structures within an earthquake simulation. The team started their GPU work with OpenACC. They later introduced CUDA and AI algorithms to improve performance.
‘They are doing an unstructured finite element analysis of earthquakes coupled to buildings, so they can develop a greater understanding of how Tokyo station, for example, might react in the event of a serious earthquake.
‘They use it to reduce their communication burden. They will probably start to use the Tensor cores, but they have not got around to it yet.
‘Right now, they are going from FP64 to FP32 actually to FP21, which they just do in software to FP16 in order to actually move the numbers around, in order to complete their communication and then iteratively refine towards the end of the calculation. For them, it’s about reducing the communication burden,’ concluded Wells.