In his second analysis of Government aid for exascale, Tom Wilkie reviews the role of research on system software, drawing on discussions at the International Supercomputing Conference (ISC’14), held in Leipzig at the end of June
Around the developed world, Governments are pushing the development of ever-faster computers, as they believe that the wider dissemination of compute power will boost industrial and economic development.
Among the mechanisms being employed are international collaboration on research; direct contracts to industry; and indirect mechanisms such as Government procurement policies, as reported here yesterday. All three strands became explicit at a day-long session on international cooperation in developing exascale technologies, held the day before the official opening of ISC’14.
Bill Harrod, programme manager for Advanced Scientific Computing Research at the DoE and thus the man primarily responsible for driving forward the US exascale programme, noted that ‘the greatest area of such cooperation is system software’ -- developing novel operating systems; software tools for performance monitoring (particularly for energy efficient computation); and system management software that will cope with hardware failures, both in processor nodes and in memory and storage.
But this raises the question of how to ensure that the fruits of such international collaboration are taken on board by the commercial vendors of HPC systems and used, by them, to foster the development of exascale machines.
Other aspects of HPC technology were on show at ISC'14, including an emphasis on cooling and energy efficiency. The commitment of Chinese companies to high-performance computing was very evident in Leipzig while at the later Teratec meeting, details began to emerge of the French national strategy.
At the session on international cooperation , Markus Geimer, from the Juelich Research Centre in Germany, discussed international collaboration on improving performance monitoring tools. Every doubling of scale reveals new bottlenecks in the performance of applications, he told the meeting.
A variety of software toolkits have been developed to monitor various aspects of the performance of an HPC machine, but they did not all talk to each other in a straightforward way. ‘From the user perspective, the whole tools business looks like a mess,’ he said. This had driven the realisation that research was needed into how the non-vendor tool-kits -- software packages such as Vampir (originally from the Technical University of Dresden), Paraver (from Barcelona), and Oregon’s Tuning and Analysis Utilities (Tau) -- could be integrated.
But research to improve the tools had to be international, he warned, as otherwise: ‘A single tools group will struggle to keep up with what is going on.’ One such international collaboration that he described was the ‘Virtual Institute - High Productivity Supercomputing (VI-HPS)’ which brings together nine partners from the EU, a team at the University of Oregon and one at the University of Tennessee in the United States, as well as collaborators from the US Lawrence Livermore National Laboratory.
The idea is to improve the quality and accelerate the development of complex simulation codes in science and engineering to run on highly-parallel computer systems. The collaboration is therefore developing software tools to assist HPC programmers in diagnosing programming errors and optimising the performance of their applications.
Because the collaboration is supported by public funds, it has focused on non-vendor software that can be ported across different types of machines and different architectures. Although it was not explicitly part of the international collaboration, he said that some vendors had included parts of the tools in their own software stacks. However, he warned, ‘Portability is not necessarily in the interests of the vendors.’ Intel would not be interested in tools that worked on IBM’s BlueGene and Cray would be interested only in applications to Cray’s machines.
The involvement of the vendors is important because, the meeting was told, their software is well supported and the documentation is well done; whereas, as Geimer pointed out, documentation on the Open Source side of things tended to be written by the developers themselves, and may be too complex for the end-user.
Performance tool-kits to aid the development of application codes on exascale machines thus become a case study – a particular instance – of the question posed more generally at the start of this series: ‘How do we make exascale happen?’
This is the point at which the third strand of Government policy weaves in with the first. James Ang, Manager of Scalable Computer Architectures at the US Sandia National Laboratories reminded the session: ‘The Department of Energy (DoE) has in the past required that some software tools be supported when the DoE does procurements of hardware systems’.
In an interview with Scientific Computing World during ISC’14, Pete Beckman, Director of the Exascale Technology and Computing Institute at the Argonne National Laboratory, returned to the point. If the DoE wanted supercomputers to be designed in a particular way, then it would tailor its procurement policy explicitly to ensure a chosen outcome: ‘The DoE will say “We are only going to buy machines that have this capability”.’ Coupled with the placing of R&D contracts with vendors, he said, ‘These two mechanisms have worked well for us.’
One particular example of the US Government’s procurement policy in action – the so-called CORAL programme – will be discussed in the final instalment of this review of how we can get to exascale.