Simon McIntosh-Smith discusses the role of the ExCALIBUR project in ensuring that UK research is at the forefront of HPC
The Exascale Computing ALgorithms and Infrastructures Benefiting UK Research (ExCALIBUR) project is a UK research programme that aims to deliver the next generation of high-performance simulation software for the highest priority fields in UK research. ExCALIBUR began in October 2019 and will run through until March 2025, redesigning high priority computer codes and algorithms to meet the demands of both advancing technology and UK research.
To continue to make scientific advances on some of the most challenging physical problems facing the world today, such as weather forecasting, engine design, astrophysics, particle physics and fusion energy, it is essential that the UK fully harnesses the power of the world’s most powerful supercomputers as they move into the exascale era. However, this cannot be achieved without appropriate software; existing simulation codes are not expected to be able to fully exploit the next generation of supercomputers.
The ExCALIBUR programme aims to address this challenge by redesigning high-priority computer codes and algorithms, keeping UK research and development at the forefront of high-performance simulation science. The challenge spans many disciplines, and research software engineers and scientists will work together to future-proof the UK against the fast-moving changes in supercomputer designs. This combined scientific expertise will push the boundaries of science across a wide range of fields delivering transformational change at the cutting-edge of scientific supercomputing.
What is the project about and what do you hope to achieve?
ExCALIBUR has been developed by the Met Office and UK Research and Innovation (UKRI) as the primary exascale software and algorithms project for the UK, initially for five years, but hopefully it will carry on for longer than that.
ExCALIBUR is led by the Met Office and includes UKRI councils and UK Atomic Energy Authority (UKAEA). The programme is primarily focused on software and algorithms. ExCALIBUR is analogous to the exascale initiatives around the world, such as the exascale computing project (ECP) in the US, Euro HPC in Europe, and similar schemes in Japan. The ExCALIBUR programme will include collaborations with these other international exascale projects.
But ExCALIBUR is primarily focused on getting the UK’s scientific codes ready for exascale. It’s a stand-alone programme for the UK, with a significant investment of around £46 million pounds, a substantial effort focused on software and algorithms.
The UK needed to make sure that its own codes were being invested in and developed in line with several key objectives. We want to make sure the UK’s science codes are going to be efficient so that they will run well on exascale machines.
We want to make sure that the UK’s science codes are also capable. In other words, they’ve got the ability to really push the envelope and take advantage of exascale machines to do things that just weren’t possible before – and to build expertise in the UK as well. We need to ensure we’ve got the right people with the right skills in the UK, who can design and develop codes, build and run cutting-edge facilities and supercomputers and so on – this is why ExCALIBUR is very much a UK project. There are other programmes in the UK that will be procuring and building exascale supercomputers. In the UK we have got ARCHER2 being brought up now, and there’s DiRAC 3 and all sorts of big things going on in the UK. ExCALIBUR’s role is to deliver the software which can effectively exploit this coming generation of exciting hardware.
Getting ready for exascale is a long-term programme. This is something we’ll be working on for the next five to ten years. There are going to be new technologies coming out in that time, some of which might be quite revolutionary and have a huge impact on what we can achieve.
Therefore, we need to do a little bit of horizon scanning. When there are interesting new technologies coming out, you don’t tend to go from nothing to all of a sudden buying a huge system based on a brand new technology that you’ve not tried out before. Therefore within ExCALIBUR about 10 per cent of the overall budget, that’s about four and a half million pounds, has been reserved to try out new technologies that might become significant for us in the long term.
This subproject within ExCALIBUR, called Hardware and Enabling Software, has been running for two years now, and we’ve been trying all sorts of interesting technologies. These include FPGA-based systems, and new kinds of storage and network technology. The Bluefield programmable network technologies are being evaluated, for example. There are quite a few processors coming out optimised for AI and machine learning, such as those from Graphcore and Cerebras, the latter of which builds a wafer-scale chip – it’s one wafer, that’s a single gigantic system. And we then make these new technologies available to the whole ExCALIBUR programme, so that everyone can try them out and discover how they might change the way they’re developing their algorithms.
How do you evaluate new technologies?
Most of these technologies are so unique, they almost need to come with their own set of success criteria. Where possible, we evaluate new technologies using the benchmarks that are most relevant to that space.
So, if there’s a new kind of processor, and it can run a particular benchmark 10 times faster than anything else, that will look really exciting. If it runs but at a similar speed to existing technology, such as a GPU for example, then you might say it was good to try, but it looks like that technology won’t be really a breakthrough for ExCALIBUR. If a vendor has a new kind of network which is programmable in some way, then we’ll work with them to try and demonstrate that that programmability yields a significant benefit of some kind. For example, it might enable a new kind of highly optimised collective communication operation, which might mean that some codes are able to go a lot faster.
If we can’t think of a really good use case and get a significant benefit from that new feature, then maybe that was interesting to try, but it’s not going to revolutionise what we’re doing in the future. In each case what we’re looking for is a route to a significant benefit from a new technology for exascale computing and exascale science codes in the UK. That’s ultimately what it comes down to.
For FPGAs, if they could get you a 10 times speed up, but you’ve got to rewrite all your code into register transfer language (RTL) and you’re basically generating hardware, that’s probably a non-starter.
For new kinds of processors, it is especially important to evaluate ease of use. So on an FPGA, you can now actually write code in a high-level language that you might want to use anyway – for example, there are some people looking at whether you can use SYCL, which is a sort of high level C++ parallel abstraction, which you can also use with GPUs and multi-core CPUs.
If you could use an approach like SYCL and from that and generate efficient code for FPGAs as well as CPUs and GPUs, that would be powerful. But if you have to rewrite all your code for a specific platform, that’s probably not going to happen. Ease of use and ease porting are key metrics in any kind of new technology we’re considering within ExCALIBUR.
Are there any early success stories from these early evaluations of new hardware?
We have got some FPGAs in the ‘Hardware and Enabling Software’ program but those are some of the more recent projects that are starting up. We have some projects evaluating the Bluefield technology from Mellanox, now Nvidia, and that’s been useful to get a feel for what these programmable network technologies can and can’t do.
We’ve had several projects evaluating different kinds of GPUs. Over the last 10 years, Nvidia has owned most of the GPU market in HPC, but other GPUs are becoming important now too. AMD GPUs in particular are being used in many of the first wave of exascale machines in the USA. Intel has some exciting GPU technology coming to HPC soon in the form of their Ponte Vecchio GPUs.
So we’ve had a couple of projects looking at AMD GPUs, and getting things running on those as well as Nvidia GPUs, and that’s been quite successful. We’ll evaluate Intel’s Ponte Vecchio GPUs when they become available too. These efforts will ensure we have much more agility in UK science codes so that they can use any of the technologies in the future when they turn out to be successful.
This means that we need UK science codes to run on whichever GPUs turn out to give us the most science per pound, whether those are AMD, Intel or Nvidia GPUs, and ideally, our codes will be able to run well on all of them. This is a key goal that ExCALIBUR is trying to achieve.
We have got some of the AI and machine learning hardware available to the project as well. For example, we have some of the Graphcore technologies available both in Bristol and at UCL. This is a nice story because the Graphcore processor was designed here in Bristol, so it’s nice to have a local link there.
Then we also have the Cerebras technology which is from the US. One of these systems has recently been installed in Edinburgh as part of the ExCALIBUR hardware and enabling software programme. This is one of the first of its kind anywhere in the world, so it’s quite exciting to be able to make these new technologies available to AI and machine learning users in the UK.