When it comes to heterogeneous computing, in the past we’ve looked upon this primarily as adding accelerators (GPUs, FPGAs, vector processors) to an x86 system. And while there is clearly some important news on the GPU front that I’ll be reporting on in this story, another trend has emerged in the past year: efforts to include low-power processors, found in mobile phones and tablet PCs, into HPC systems.
Most interest thus far has been surrounding ARM devices, but there is also some activity with the Intel Atom. The primary driver of adding such devices is low power, which is a major concern in efforts to build larger, more powerful systems. Such devices are attractive, says Sumit Gupta, manager of Nvidia’s Tesla high-performance GPU computing business unit, because ‘ARM designs its chips starting out with the assumption that they have zero Watts available.’
SoC now also means 
‘server on a chip’
How will it be possible to integrate these chips into an HPC system? One answer comes from Calxeda, whose EnergyCore architecture is based on an ARM core. In this regard, the company is taking the established acronym SoC, which for most people has meant ‘System on a Chip’, and instead redubbing it as ‘Server on a Chip’ because this device contains all the functions, except for memory, needed to work as a server.
The EnergyCore ECX-1000 comes with up to four ARM Cortex-A9 cores, Neon extensions for multimedia and SIMD processing, an integrated floating-point unit, and 4MB of shared ECC L2 cache drop energy consumption by reducing cache misses. Its server-class I/O controllers support standard interfaces such as SATA and PCI Express, and each SoC contains five fabric links that operate between 1 and 10 Gbps per channel and support a variety of fabric topologies. With node-to-node latency under 200ns, network round-trip times are considerably faster than a traditional top-of-rack switch.
To make it easier to integrate EnergyCore products, Calxeda sells the Quad-Node Reference Design. It holds 16 cores on four EnergyCore SoCs integrated through a network fabric that collectively forms a complete cluster that can be easily expanded with additional cards. Leveraging the integrated I/O capabilities within each processor, the reference card exposes four (of five possible) SATA ports per SoC. There is also a slot dedicated per SoC to enable diskless system designs by booting from a microSD memory card.
According to Calxeda, EnergyCore is an architecture intended to dramatically cut power and space requirements for hyperscale computing environments such as scalable analytics, web serving, media streaming, infrastructure and cloud storage.
One company very enthusiastic about the EnergyCore is Hewlett-Packard, which is using that device in the first development board for its Project Moonshot. Furthermore, notes Ed Turkel, manager, business development, service providers and HPC business, such devices are redefining what a ‘server’ is. We traditionally think of a server as being packaged in a rack or something even larger, but now there’s a 5W server in a chip where each core is independently bootable and can run different software stacks. He adds that Moonshot is not tied exclusively to the ARM and that HP expects to have different low-power processor types; eventually Intel’s Atom.
Purpose-built servers
How heterogeneous server configurations using such flexible boards will be set up will vary with the workload. Some scientific applications, for instance, might work best on a system with just a few CPUs to handle housekeeping, but where the ‘heavy lifting’ is done primarily with GPUs. At the Moonshot launch, the design focus was decidedly not HPC; this implementation is better tailored for memory caching, Hadoop and Java applications. ‘However,’ adds Turkel, ‘when we showed our reference design to our HPC users, they showed significant interest. Financial applications, for instance, don’t involve large parallel jobs – such as is the case in much science – but rather lots of small jobs that do millions of small calculations. This first effort is not intended for big parallel MPI jobs and Top500-type systems, but it is ideal for embarrassingly parallel workloads such as genetics or for security agencies looking for specific words in text strings. There will eventually be a mix of options concerning the CPU and the accelerator and users will have to examine their applications closely when making the choice. There will be purpose-build servers for different workloads.’
Calxeda has attracted the attention of other companies active in the server space and it now has business agreements with the hardware manufacturer and systems integrator Boston, along with companies supplying support software such as MapR Technologies and uCirrus, plus the data-storage company ScaleIO. More specifically, the Boston Viridis Project packs 48 nodes in a 2U enclosure, so this platform can provide 900 servers in a standard 42U rack and deliver up to 10 times the performance per Watt over existing processor technologies.
While the ARM is attractive on the power front, what we really need for HPC, says Nvidia’s Gupta, is a 64-bit version and ECC memory protection. ARM has announced it is working on a 64-bit version based on the ARMv8 architecture and expects volume production by 2014. But there are already systems combining the ARM and GPUs, such as with chips from Nvidia, whose Tegra 3 incorporates a quad-core ARM Cortex-A9 CPU with a GeForce GPU. This device is currently being used in mobile phones and tablet computers. To aid developers, Nvidia also offers the CARMA (Cuda on ARM) development kit.
It appears that in the future we can certainly expect GPUs working as accelerators for ARM processors in HPC systems. We’re getting a glimpse of what such a supercomputer could look like thanks to a project at the Barcelona Supercomputer Center, which is developing a system based on the Tegra 3. This system is now going into the prototype-building stage and the operators hope to start running benchmarks this summer. It should be interesting to follow exactly what this machine accomplishes when complete.
Making petaflops affordable
A device that got the scientific community quite excited in GPU computing was Nvidia’s Fermi chip, with its ECC and double-precision processing. Now the company is introducing the Kepler chip, which Gupta calls ‘a revolution that will change the face of HPC and put petaflop power in the budget of mid-sized enterprises.’ Available now on the Tesla K10 board, the K10 chip is a single-precision accelerator that offers three times the performance per Watt of the Fermi.
This comes primarily from the fact that while the Fermi chip has 32 SMs (streaming multiprocessors), the Kepler benefits from this building block being redesigned and now has 192 such cores. Gupta says this means you can get 1 petaflop in 10 racks at 400 kW, whereas with an Intel Sandy Bridge chip you need 100 racks and 2 to 3MW.
The most exciting news for the HPC community, though, will come later this year with the K20 chip. Not only does it have all the features of the K10, it performs double-precision math and features several new architectural benefits not available in the K10. The first of them is Hyper-Q. In the Fermi, only one MPI task can run at a time, whereas the Kepler chip can run 32 simultaneous MPI tasks. This ability greatly increases GPU utilisation and cuts CPU idle time. Next is dynamic parallelism. With the Fermi, the CPU sends jobs to the GPU; with the Kepler, the GPU adapts to data and dynamically launches new threads on its own. The CPU is no longer in charge of all actions, which makes the GPU more autonomous and also makes programming GPUs easier. ‘Because of dynamic parallelism, Kepler can now accelerate almost any application,’ claims Gupta.
On the software front, Nvidia has contributed the Cuda compiler to the LLVM open source project. This means that programmers and developers can build front-ends for Java, Python, R and domain-specific languages, and they can target other processors such as the ARM, FPGAs, AMD GPUs and the Intel MIC (Many Integrated Core) architecture. In addition, Cuda 5 also supports full GPUDirect, meaning that GPUs can communicate directly among themselves not only on a card, but now GPUs on different cards can also communicate directly. Finally, in this latest version of Cuda, Nvidia has taken its Nsight (a plug-in for Visual Studio) and ported it to Linux and the Mac. It now allows third-party GPU library object linking.
Experience with both x86 and GPUs
AMD has also been active in the GPU space and today has APUs (Accelerated Processing Units, which are combination CPU/GPU chips) ‘but they currently have no ECC and are not ideal for HPC,’ notes John Fruehe, director, product marketing, server at AMD. These are client-side APUs, with the Llano being more akin to a low-end server APU.
‘However, people are developing larger systems with these chips and these APUs will start appearing in the server space once we have ECC in 2013,’ adds Fruehe. What will make AMD attractive in this market? ‘For Intel, the world is an x86 problem and every problem has an x86 solution; for Nvidia, every problem has a GPU solution. We have experience with both x86 and GPUs – we’re the only people talking seriously in both technologies.’
As for the advantages of ARM, Fruehe likewise points out that it is not yet 64 bits and the software ecosystem doesn’t exist to anywhere close to the extent that it does for the x86: ‘It will be much easier for us to drive down the power curve and have an x86 ecosystem than for ARM to go up the power curve as they add cores and bandwidth. It will be a better option to go with AMD devices targeted at that space and have the x86 software ecosystem.’
It’s also interesting that AMD recently acquired SeaMicro, whose low-power supercompute fabric enables heterogeneous systems with thousands of processor cores, memory, storage and I/O traffic. SeaMicro’s patented technology, called CPU input/output (I/O) virtualisation (IOVT), reduces the power draw of the non-CPU portion of a server by eliminating 90 per cent of the components from the motherboard, leaving only three components: the CPU, DRAM and the Freedom ASIC. This latter device is needed because SeaMicro currently uses off-the-shelf Atom chips and thus cannot design its own SoC devices with integrated Atom cores; this separate ASIC handles the storage and networking virtualisation.
Even so, CPU I/O virtualisation allows SeaMicro to shrink the motherboard to the size of a credit card. For example, the SM 10000 family of servers integrates 768 Atom x86 cores packaged in 64 compute cards, top-rack switching, load balancing and server management in a single 10 RU system. It’s unusual that AMD is now selling a system with Intel chips, but it has plans to offer the first Opteron-based solutions that combine AMD and SeaMicro technology in the second half of this year.
If the trend of taking large-volume chips from the mobile phone/tablet space and adapting them to HPC systems continues, as it likely will, we might also see some interesting developments coming up. Just two examples are ARM, which has its own GPU called the MALI, and Qualcomm, whose Snapdragon processors have their Adreno GPU. These devices are currently optimised for graphics applications, but remember that Nvidia got its start in the graphics acceleration business as well. If the HPC market grows large enough, it could entice these chip vendors to add features needed for HPC.
A common code base
It’s worth mentioning that developing software for these heterogeneous systems will not get easier in the short term. In the past, explains Wen-Mei Hwu of the Illinois Microarchitecture Project utilising Advanced Compiler Technology (IMPACT), people had to write a different kernel for each GPU type for a specific application. It thus becomes very costly for software developers to do multiple versions, especially as maintenance costs blow up.
He adds that there is an entire movement to eliminate the need for developers to replicate their code base. Towards this goal, the IMPACT Project developed the Multicore Cross-platform Architecture (MXPA), an OpenCL runtime and compiler which enables cross-architecture performance from a single, unified codebase. This technology was recently sold to MulticoreWare for commercialisation, and Dr Hwu is now the firm’s CTO. He notes the advantages of MXPA: it enables multicore x86 performance comparable or superior to existing implementations of the OpenCL language, it can extend the performance of OpenCL applications to multicore platforms without dependencies on a client-installed OpenCL runtime or exposure of uncompiled source code, and it can retarget arbitrary hardware platforms through a programmable specification to transform C-language intermediate representation to targeted C compilers and threading libraries.