Fugaku is the fastest supercomputer in the world as of 2021, developed with both performance and generality in mind with the ‘applications first’ philosophy. The supercomputer has been developed and meticulously planned over 10 years to codesign both hardware and software to make the most efficient use of the resources available.
The resulting system has been at the top of major benchmarks for the last 12 months but Satoshi Matsuoka, head of the Riken Center for Computational Science at RIKEN, explains that applications, not benchmarks, were the target. ‘Fugaku in absolute terms is the fastest and largest supercomputer ever built,’ states Matsuoka. ‘It has been built with an applications first strategy, and as a result of being very general-purpose and highly performant at the same time, which created a very hard challenge.’
The Fugaku supercomputer is housed at RIKEN, which is a network of research centres across Japan, with main campuses in Wako, Tsukuba, Yokohama, Kobe and Harima. RIKEN’s activities can be divided into four main categories: strategic research centres, research infrastructure centres, the Cluster for Pioneering Research and the Cluster for Science, and the Technology and Innovation hub. Fugaku will support researchers across these disciplines, as well as industrial and academic partners across the world.
‘We had to build a machine that was very fast, but also it had to be general-purpose. It could not use a special purpose accelerator, due to the broad range of applications and users we had to make it very general purpose – and these can be quite contradictory goals,’ notes Matsuoka. However, ‘We achieved this by the participation of all the interested parties here in Japan as a national project taking high risks. As a result, we came up with this A64FX chip co-developed between Fujitsu and RIKEN which is the fastest chip built for HPC in that it is two to three times faster than the latest Intel offerings in HPC across many applications. At the same time, it is several times more power-efficient while also being a general-purpose Arm processor.’
‘To put Fugaku in context, it is equivalent to about 20 million smartphones that run the same code,’ states Matsuoka. ‘This means that if you have one or two Fugaku’s it is equivalent to the entire IT of Japan.’
While achieving the benchmark target was not a goal of the teams behind Fugaku, they did want to ensure wide-ranging application performance and to speed up compared to the previous generation of RIKEN supercomputers (the K computer). ‘Of course, achieving peak flops or some benchmark was not our priority,’ stressed Matsuoka. ‘In fact, we identified important applications for sustainability for the nation and for the world in areas like healthcare, environmental disaster, materials and manufacturing which had been running on the K computer. The goal was to achieve a large speed-up over the performance of the K computer and some applications have achieved this magnitude running more than 200 times faster.
General purpose performance
The development of this new HPC processor is a huge achievement but the success did not stop there as there are several other innovations that make this supercomputer deliver world-beating performance. Chief among those innovations was the development of the Tofu Interconnect to support the bandwidth and sustained performance of this hugely complex machine.
‘While the A64 chip is a huge success story, the interconnect in the Fugaku system also plays a large role in delivering sustained performance in real-world applications,’ said Matsuoka. ‘The interconnect has both the network interface and the switches all embedded. That is to say that there are no external switches on Fugaku. Rather, each CPU port has a 10 port switch so there are 160,000 switches in Fugaku, equalling the number of nodes, or 1.6 million ports.’
‘The sheer bandwidth of the machine is six petabytes of network injection, which is ten times the magnitude of all the data centres internal traffic according to Cisco,’ Matsuoka continued. ‘System architecture wise, it is the world’s first ultra-scale disaggregated architecture. The cores, memory and everything can act independently. For example, any memory from any part of the system can be injected into the L2 cache of any processor without any other processor intervention.’
Define exascale
Exascale has been a long sought after goal for HPC because it represents the next order of magnitude of performance in supercomputers since the first petaflop systems were announced more than ten years ago. While there have been significant and wide-ranging advances in technology since that time, the elusive exaflop has been out of reach, However, since that time the types of computation conducted on these systems has also changed. Ten years ago, before the advent of AI and ML technologies, FP64 or double-precision was a ubiquitous standard for many HPC applications but increasingly single-precision FP32 or half-precision FP16 is used, particularly for AI. This has meant that some of these systems can deliver an exaflop of reduced precision performance in certain applications – even if the FP64 figure has not yet reached an exaflop.
‘People sometimes contest us when we say that we offer the first exascale supercomputer,’ notes Matsuoka. ‘But what do we mean by exascale? Well, there are several definitions. If you think exascale is the FP64 performance then an exaflop would be represented by the peak performance or achieved LINPACK performance and of course, for Fugaku this is not the case. It’s RMAX max is 0.44 exaflops.’
Matsuoka continued: ‘However, very few applications correlate with IP 64 Absolute dense matrix linear algebra performance in this context. So actually this may not be a valid definition when you think about the capability of a supercomputer.’
‘The second possible definition is any floating-point precision performance that is bigger than an exaflop or a metric from some credible application. In that respect, Fugaku is an exaflop machine, because, for example, in HPL we achieved two exaflops. However, OAK Ridge National Labs Summit machine has achieved two exaflops in the Gordon Bell-winning applications. So, although Fugaku is an exascale machine by this definition, it was not first,’ added Matsuoka. ‘I think the most important definition when we started thinking about these exascale machines was to achieve almost two orders of magnitude speed up as compared to the current, what was the current state of the art in 2011/2012 timeframe when we had 10 to 20 petaflop supercomputers.
‘As I have demonstrated, Fugaku is about seven times faster across the applications than the K computer, which was an 11-petaflop RMAX machine. And because of the “application first” nature of the machine we believe this is the most important metric. We have achieved a two orders of magnitude speedup over our last generation machine which you would call a 10- to 20-petaflop machine. Being application first, this was the most important and in this context we have achieved what was expected out of the exascale machine,’ Matsuoka concluded.
In a presentation from the recent ISC high performance conference, Lori Diachin, project deputy director for the US Exascale Computing Project (ECP), discussed the work going on in the US to prepare for the first wave of exascale-class machines. Diachin also discussed some of the challenges that the ECP face in delivering three exascale systems for the US national labs.
‘The Exascale computing project has three primary technical focus areas. The first is application development. In this area, we have selected 26 design centres, which are focused on the computational motifs that many applications can take advantage of. So, for example, adaptive mesh refinement type kernels, high order discretisation kernels, particle methods, and so on,’ said Diachin. ‘The second major technical area is software technologies. In this area, we are working very hard to develop an integrated software stack, so that applications can achieve their full potential on the exascale computing platforms. ‘We have 71 different products that are focused on runtime systems, programming models, math libraries, data visualisation tools and the software ecosystem writ large.
'Finally we have our hardware and integration area, which is focused on the delivery of our products on the DOE facilities. It’s also been focused on partnerships with six of the US HPC vendors to look at the research and development that was needed for excess scale nodes and system design,’ added Diachin.
Specialised challenges
Whereas Fugaku was designed around CPU-only hardware to ensure the general purpose nature of its applications suite, the US exascale systems rely on large numbers of GPU’s to deliver performance. This has thrown up some additional challenges for the ECP – not only because they must cater for GPU technology but because there are several vendors and competing technologies that are being used in the various US exascale systems. The three systems will eventually use GPUs from Nvidia, AMD and Intel and this means that the ECP has to optimise for all three technologies.
‘One of the things that’s very interesting, in these systems and that’s really been driving a lot of the work within the ECP is the fact that the accelerator node architectures that the United States has been focused on is changing from being an Nvidia only ecosystem to an ecosystem that has a wider variety of GPUs, in particular, the exascale systems will have AMD, Intel GPUs … and that’s driving a lot of our work for performance portability,’ explained Diachin. ‘In our application portfolio we have some 24 applications that were chosen because they were of strategic importance to the Department of Energy,’ Diachin continued.
‘They range in topics from national security applications such as next-generation stockpile stewardship codes to energy and economic security scientific discovery, Earth System modelling and healthcare in partnership with the National Institute of Health (NIH).’ ‘With these 24 applications in the six co-design projects we have more than 50 separate codes that have over 10 million lines of code collectively. When the project began many of these codes were focused on MPI or MPI plus open MP and were largely focused on CPU with, with a small number of them starting their work with GPU accelerated computing. So since the beginning of the project in 2016, each code has had to develop a unique plan to make progress toward deployment on the excess scale systems. Moving away from CPU only implementations to performance portable GPU implementations.’
‘When we talk about preparing applications for the exascale systems. There is a hierarchy of adaptations that have needed to happen, so what do we need to do at the lowest levels to rewrite and optimise our codes,’ states Diachin. ‘There’s been a lot of data layout that has been done, loop reordering, kernel flattening, and so on. ‘It’s really focused on those lower-level operations that can really improve the performance in different kernels of the application.’