Andrew Jones explains why science/dollar is not the same as Flops/dollar
A casual observer from outside the high-performance computing (HPC) community watching our events, news sites and discussions might easily conclude that HPC is about getting as much money as possible from your funding agency or board and then buying the most Flops capacity (crudely ‘calculations per second’) possible. Even better if this is done by choosing a computer system that is in some way unique – we like ‘serial number 1’. We then proudly issue press releases declaring it as the biggest supercomputer in [xyz] – where the category [xyz] is carefully chosen and defined such that your supercomputer is at the top of the pile in [xyz].
This game of getting the most Flops possible has been gifted a performance leap in recent years by the emergence of new classes of processor with a greater proportion of their of silicon devoted to calculating units: graphics processing units (GPUs), especially from Nvidia; and Intel’s Xeon Phi (which my brain still defaults to calling MIC because I’ve known it as that for so long). GPUs and Phi (is the plural Phis?) promise maybe an order of magnitude more Flops for a given dollar or power budget than traditional processors.
So, big budget, a data centre full of racks with as many cores as possible, and plenty of GPU/Phi cards wedged in to get that Flops score capacity as high as possible. Now what?
Well, this pile of silicon, copper, optical fibre, pipework, and other heavy hardware makes an imposing monument that politicians can cut ribbons in front of and eager supercomputer managers can give tours around. It is great for pointing to as proof that all that money has bought a really big and nice ‘thing’.
But it turns out that something else is needed to make that pile of processed sand, metal and supporting gubbins into the powerful multi-science instrument that the funding agency sought, or the engineering design capability that the company management was convinced by.
That something else is a complex ecosystem of system architecture, software, and people. And business processes too, but that is a discussion for another time.
A well designed and implemented system architecture is required to make sure that all of those pretty Flops engines (whether GPU, Phi or CPU) can do useful work. I’m not going to delve into that here, only to say it is the art of balancing the desires of capacity, performance and resilience against the frustrations of power, cooling, dollars, space, and so on. Characteristics such as having most of the Flops promise residing in GPUs or Phi co-processors, or larger than average scale, or ‘serial number 1’ all make this more interesting [synonym for difficult].
But even a perfectly architected hardware system is powerless without software. Software is the magic that enables the supercomputer to do scientific and engineering simulations. Of course, it is not really magic, even if it sometimes seems that way. Software is a complex collection of applications (maths, science and engineering knowledge crafted into bits), middleware (to make entire ecosystem can chug along smoothly) and tools (to fix it when it doesn’t chug along so smoothly). In fact, whisper it loudly, software is infrastructure – yes, infrastructure. Software requires investment to create and maintain, it takes time to build and usually provides capability for a multitude of use cases and hardware platforms. Software can [should] be a highly engineered asset that in many cases is worth far more than the lump of tin that usually attracts the ‘infrastructure’ label.
Application software encapsulates some existing understanding of the relevant maths, science and engineering of a problem or system. This virtual knowledge engine is combined with an understanding of the hardware and cooperating software resources (e.g. communication libraries) into a set of methods and processes that enable a user to study and predict the behaviour of the [science/engineering] problem or system, or to test that encapsulated understanding.
Hopefully, the keen-eyed reader will have noticed the critical word in that preceding paragraph. It was ‘user’. Without people, the whole ecosystem so far (hardware and software) is like Concorde – a thing of great potential performance but just sat there looking pretty now.
Delivering science insight or engineering results from this powerful tool of hardware and software requires users. In fact, it requires a whole ecosystem of people. It needs the scientists/engineers who understand how to apply the tool effectively. It requires computational scientists and HPC software engineers who develop and optimise the application software. It needs HPC experts to design, deploy and operate the hardware and software systems. It requires HPC professionals to develop a HPC strategy, match requirements with solutions, procure capabilities, and ensure a productive service.
And, just as we need to have a roadmap for the hardware technology so that we can plan ahead, and a clear recognition that software needs long-term investment to thrive and deliver the promise of HPC, we also need a long-term plan for the people. We need to invest in the development of this part of the ecosystem just like the hardware and software. Because the component units (that’s us lot) have a long preparation time (education) together with a plethora of exits-from-useful-service (from the predictable such as retirement, to the unpredictable and fast-acting such as a better job offer). Thus, we need continually renewing. And, because the demand for HPC and the complexity of HPC is growing, so we need more people of varying skill sets. It is also worth theorising that, like hardware and software, quality commands a premium – if we want the best capability of people with sufficient capacity, then we will have to invest in developing those and funding them appropriately when in place.
Getting a HPC capability that can deliver the best science or engineering is harder than just Flops/dollar or Flops/Watt – or to put it another way, science/dollar is not the same as Flops/dollar. But when the ecosystem of hardware, software and people is properly resourced and balanced, our causal outside observer might not see HPC at all – just an incredibly powerful scientific instrument or a capability-defining engineering design and validation tool.
Andrew Jones is VP of The Numerical Algorithms Group's HPC Expertise, Services and Consulting Business. He is active on twitter as @hpcnotes