Two key projects within the US and Europe are attempting to solve the power challenge of exascale high-performance computing (HPC) systems by focusing predominantly on the energy efficiency of software. This approach is contrary to many other projects which aim to address efficiency on a hardware level through the use of cooling technologies or accelerators such as Intel’s Xeon Phi and Nvidia GPUs.
Led by Argonne National Laboratory’s Peter Beckman, the three-year US project, dubbed Argo, is a multi-institutional research project to design and develop a platform-neutral prototype of an exascale operating system and runtime software. Researchers in Argonne’s Mathematics and Computer Science Division will collaborate with scientists from Pacific Northwest National Laboratory and Lawrence Livermore National Laboratory, as well as with several universities throughout the US.
Four key innovations are at the heart of the project: dynamic reconfiguring of node resources in response to workload, allowance for massive concurrency, a hierarchical framework for power and fault management, and a ‘beacon’ mechanism that allows resource managers and optimisers to communicate and control the platform. These innovations will result in an open-source prototype system that runs on several architectures, and is expected to form the basis of production exascale systems deployed in the 2018–2020 timeframe.
The design is based on a hierarchical approach. A global view enables Argo to control resources such as power or interconnect bandwidth across the entire system, respond to system faults, or tune application performance. A local view is essential for scalability, enabling compute nodes to manage and optimise massive intranode thread and task parallelism and adapt to new memory technologies.
To achieve such a whole-system perspective, the Argo team members introduced the idea of ‘enclaves’: a set of resources dedicated to a particular service, and capable of introspection and autonomic response. Enclaves will be able to change the system configuration of nodes and the allocation of power to different nodes, or to migrate data or computations from one node to another. The enclaves will be used to demonstrate the support of different levels of fault tolerance – a key concern of exascale systems – with some enclaves handling node failures by means of global restart and other enclaves supporting finer-level recovery. The team will use Department of Energy science applications to evaluate the correctness, scalability, resilience, and completeness of Argo on both homogeneous and heterogeneous computer architectures.
Peter Beckman commented that bringing together multiple views such as global optimisation, power management, code integration, lightweight threads, and interconnection fabrics, and the corresponding software components through a whole-system approach distinguishes the team’s strategy from existing designs. ‘We believe it is essential for addressing the key exascale challenges of power, parallelism, memory hierarchy, and resilience,’ he said. Version 1.0 of the Argo exascale operating system and runtime system is expected at the end of three years. Funded by the Department of Energy’s Office of Science to the tune of $9.75 million, the Argo project was foreshadowed in the article ‘Software speeds supercomputers?’.
The EU-funded project, ADEPT (addressing energy in parallel technologies), is addressing energy consumption by developing a tool that lets users model and predict the performance and power usage of their code. ADEPT will investigate the implications of parallelism in programming models and algorithms, as well as choice of hardware, on energy and power consumption. Much like the Argo project, a variety of computing architectures will be examined, including the use of CPUs, GPUs, ARM processors, and FPGAs. In terms of software, the project will look at how different implementations of the same algorithms behave in terms of power usage. EPCC currently has two specific applications in mind.
Dr Michèle Weiland, a project manager at EPCC who put forward the initial project proposal, describes the first code as being quite simplistic but that does a very important mathematical operation in relation to seismology modelling. EPCC currently has many versions of the code running on different hardware architectures. The second, an EPCC code, is a much larger Lattice Boltzmann fluid dynamics code.
‘These are typical scientific applications and if we can show that we can model their performance and power usage for any type of hardware we’d like to simulate, we will be closer to understanding the trade-off between performance loss and energy efficiency gain,’ said Weiland. She added that more applications may be selected during the course of the project.
The ADEPT project aims to develop a prototype tool within the next two to three years that will be able to quantify and predict how individual codes will perform, and what energy efficiency gains can be made on any type of specified hardware. This will potentially open up the possibilities in terms of hardware procurement and provide educated estimations of how software will perform, enabling people to modify their codes in order to reduce power consumption.
Coordinated by the EPCC, the supercomputing centre at the University of Edinburgh, ADEPT has brought together partners from Uppsala University, Sweden; Alpha Data, UK; Ericsson AB, Sweden; Ghent University, Belgium; and EPCC, UK, in a bid to capitalise on the skills relating to power management possessed by the embedded industry, and the classic HPC techniques for parallelising code. It is scheduled to run for three years, from 1 September 2013.