Chris Gottbrath, principal product manager at Rogue Wave Software, presents a case study on debugging with TotalView on Beacon
With the launch of the Intel Xeon Phi coprocessor, developers have been presented with many exciting opportunities to take advantage of many-core processor technology. Because the Intel Xeon Phi coprocessor shares many architectural features and much of the development tool chain with multi-core Intel Xeon processors, it is fairly simple to port a program to the new coprocessor. However, taking full advantage of the new power offered by the Intel Xeon Phi coprocessor requires expressing a level of parallelism that demands a re-thinking of algorithms. This is the exact challenge that the National Institute for Computational Sciences (NICS) at the University of Tennessee, US, is working towards overcoming with its Beacon Project.
The Beacon Project is an ongoing research project funded by the US National Science Foundation and the University of Tennessee to explore the impact of emerging computer architectures on computational science and engineering. Currently, there are nine teams associated with the Beacon Project that are exploring the impact of the Intel Xeon Phi coprocessor on scientific codes and libraries, with approximately two dozen more open call applicants about to begin work. Some of the programs that are being optimised as part of the project include magneto hydrodynamics, plasma physics, cosmology, chemistry, quantum chromodynamics, and bioinformatics applications.
The Beacon system, which received the number one ranking on the November 2012 Green500 list, offers access to 48 compute nodes and six I/O nodes joined by FDR InfiniBand interconnect providing 56 Gb/s of bi-directional bandwidth. Each compute node is equipped with two Intel Xeon E5-2670 processors, four Intel Xeon Phi coprocessors 5110P, 256 GB of RAM, and 960 GB of SSD storage. In total, Beacon provides 768 conventional cores and 11,520 accelerator cores – meaning that the system offers 210 Tflops of combined computational performance, 12 TB of system memory, 1.5 TB of coprocessor memory, and more than 73 TB of SSD storage, in aggregate.
The typical strategy for developers participating in the Beacon Project is to first port and then optimise code for the Intel Xeon Phi coprocessor. An example of this is the Boltzmann-BGK Solver, which uses a kinetic model for computational fluid dynamics. With hundreds of thousands of state variables that need to be solved at each grid point, the BGK model Boltzmann equation can directly benefit from vectorisation and acceleration on the Intel Xeon Phi coprocessor. As part of its optimisation process for this solver, the team used the early-access version of TotalView to debug its native Intel Xeon Phi code and drill down to the thread level in order to debug issues that came up during porting. The team tracked down a subtle problem and discovered that the answers were wrong in the OpenMP version.
Using TotalView, the team analysed the operations occurring on each OpenMP thread. Being able to compare the data from each thread with the ultimate result clarified what was happening with the code and allowed the team to work with the vendor to get the problem resolved. After porting the code, the team was able to quickly identify and correct initial performance problems, enabling positive speedup on the Intel Xeon Phi coprocessor relative to the Intel Xeon processor.
Another example, from the Beacon Project, of a successful port to the Intel Xeon Phi coprocessor is the Gyro tokamak plasma simulation code from General Atomics. Gyro numerically simulates tokamak plasma microturbulence and computes the turbulent radial transport of particles and energy in tokamak plasmas and solves 5-D coupled time-dependent nonlinear gyrokinetic Maxwell equations with gyrokinetic ions and electrons. The team porting the Gyro tokamak plasma simulation code faced a major problem, which turned out to originate from quite a different source than that which was initially thought based on error messages. The code was first ported by adding the ‘-mmic’ compiler flag and was structured around using MPI to express multi-node parallelism and OpenMP for expressing parallelism across a number of threads in order to take advantage of multiple core compute nodes.
Using TotalView, the team tracked down the issue that was causing some runs to complete and others to fail in a strange way. In the many-core environment of the Intel Xeon Phi coprocessor, the number of threads created per MPI process was increased from single digits up to 50 or 100. The work distribution scheme had an assumption that was no longer valid, and therefore work was not being distributed to most of the threads. Had this not been fixed, the performance would have been limited since many cores would have been underutilised. Moreover, in this case, the mistake also had a cascading effect that ultimately caused the MPI processes to run out of memory. Fixing the issue also made the program more balanced, which resulted in better performance.
The Beacon Project has experienced initial success with porting and optimising code for the Intel Xeon Phi coprocessor. The optimisation process exposed the need for advanced tools that help scientists debug and optimise parallel applications so that they can effectively support hundreds of threads per node. Applications now need to have large numbers of threads, or else they will be unable to utilise more than a fraction of the Intel Xeon Phi coprocessor’s computational power.
The biggest future challenge is that there are still vast numbers of MPI-based applications that will need to be ported to MPI/OpenMP hybrid parallelism. When this is undertaken, the structure of the code is changed in fundamental and extensive ways, and these changes often break the code. TotalView has proven to be critical in alleviating these growing pains by making it easier and quicker to analyse and resolve defects uncovered or created during the porting process.