Architectural specialisation is one option to continue to improve performance beyond the limits imposed by the slow down in Moore’s Law. Using application-specific hardware to accelerate an application or part of one, allows the use of hardware that can be much more efficient, both in terms of power usage and performance.
As discussed in the ISC content on page 4, this is not a tactic that can be used for all applications, because of the inherent cost of building computing hardware for a single application or workflow. However, by combining challenges into groups or identifying key workloads or code that could benefit from acceleration, is likely to become an important part of increasing application performance.
Some applications are well suited to technologies such as graphics processing units (GPUs) or field-programmable gate arrays (FPGAs), which can boost performance by implementing acceleration technologies.
GPU acceleration or architectural specialisation are not new concepts, but some experts predict they will become increasingly common to speed up performance and also lower energy costs of future systems.
Researchers at Lawrence Berkeley National Laboratory’s Computer Architecture Group are using an FPGA to demonstrate potential improvement via a code using Density Functional Theory. The project, ‘DFT Beyond Moore’s Law: Extreme Hardware Specialization’ for the Future of HPC, aims to demonstrate purpose-built architectures as a potential future for HPC applications in the absence of continued scaling of Moore’s Law.
The final intention would be to develop plans for a custom application-specific integrated circuits (ASIC), but the initial development will be carried out on an FPGA. While this project is still in progress, it demonstrates how particular codes, or sections of code suitable for highly parallel execution, can be implemented on FPGA technology which could supplement a CPU or CPU/GPU-based computing system in the future.
The search for dark matter
CERN are using Xilinx FPGAs to accelerate inferencing and sensor pre-processing workloads in CERN’s search for dark matter. The researchers behind the project are using the FPGA in combination with CERN’s other computing resources to process massive quantities of high-energy particle physics data at extremely fast rates to find clues to the origins of the universe. This requires filtering sensor data in real-time, to identify novel particle substructures that could contain evidence of the existence of dark matter and other physical phenomena.
A growing team of physicists and engineers from CERN, Fermilab (Fermi National Accelerator Laboratory), Massachusetts Institute of Technology (MIT), The University of Illinois at Chicago (UIC), and University of Florida (UF) led by Philip Harris, MIT, and Nhan Tran, a Wilson Fellow at Fermilab, wanted to have a flexible way to optimise custom-event filters in the Compact Muon Solenoid (CMS) detector they are working on at CERN. The very high data rates of up to 150 Terabytes/second in the CMS detector require event processing in real-time, but trigger filter algorithm development hindered the team’s ability to make progress.
Harris explained the idea behind the project: ‘We were inspired after talking to a few people who had been working on machine learning with FPGAs from the Microsoft brainwave team, and seeing on Github some very simple machine learning inference code written by EJ Kreinar using Xilinx’s Vivado HLS tool. The combination of those two got us very excited, because we could actually see the potential to do this hls4ml project to enable fast ML-based event triggers.’
The team set out to develop and benchmark a tool flow, based around Xilinx Vivado HLS, that would shorten the time needed to create machine learning algorithms for the CMS level one trigger. The hls4ml tool has a number of configurable parameters that enable users to customise the space of latency, initiation interval, and resource usage tradeoffs for their application.Prior to the team’s work to create hls4ml, physicists would have to manually create simple trigger algorithms, and engineers would then program the FPGAs in Verilog or VHDL. This was a very time-consuming process that could take several man-months of work by expert physicists and engineers.
Tran said: ‘We envisioned at a very high level putting neural networks into the level one trigger. Nobody had really considered the possibility of generically putting neural networks of all different types there. Once you give that capability to the community, then it can be everywhere. We’re seeing it in muon identification, tau leptons, photons, electrons – all the particles that we see – we can improve the performance using these more sophisticated techniques.’
Raising the level of abstraction with hls4ml allows the physicists to perform model optimisation with big data industry-standard open source frameworks such as Keras, TensorFlow or PyTorch. The output of these frameworks is used by hls4ml to generate the FPGA acceleration firmware. This automation was a big time saver. Tran said: ‘Electrical engineers are a scarce resource in physics and they’re quite expensive. The more we can use physicists to develop the algorithms and electrical engineers to design the systems, the better off we are. Making machine learning algorithms more accessible to the physicist helps a lot. That’s the beauty of why we started with HLS and not the Verilog or VHDL level. Now, we can do the whole chain from training to testing on an FPGA in a day.’
Harris explained how physicists’ search for dark matter using machine learning algorithms when they don’t know what it actually looks like, in order to train the neural networks. ‘We make a hypothesis for what it will look like, and write down a list of all the signatures we would expect for dark matter,’ said Harris.
‘We’re training on a very generic class of signatures. For example, dark matter, by its nature, will be missing energy in the detector because it will go right through it. If we can use machine learning techniques to optimise the performance to understand missing energy, that improves our sensitivity to dark matter as well,’ said Tran.
The team uses multi-layer perceptron neural networks with a limited number of layers to meet the 100 nanosecond, real-time performance requirements of triggers. In addition to AI inference, FPGAs provide the sensor communications, data formatting and pre-filtering compute required for the incoming raw sensor data, prior to the inference driven trigger; thereby accelerating the whole detector application.
Tran summarised the benefits of the hls4ml project. ‘In our day-to-day work, it really allows us to access machine learning at every level across the experiment with the trigger. Before, you would have to think about a very specific application and work really hard on developing the model and the firmware for either VHDL or Verilog. Now, you can think more broadly about how we can improve the physics, whether it’s low-level aggregation of hits in some calorimeter all the way up to taking the full event and optimising for a particular topology. It allows the spread and adoption of machine learning more quickly across the experiment.’
Project catapult
Microsoft’s Project Catapult is another example of how FPGAs are being used to specialise computing architecture. It is essentially a deployment of FPGAs in the cloud which interconnect CPUs to provide an interconnected and configurable compute layer of programmable silicon.
The project started in 2010 when a small team, led by Doug Burger and Derek Chiou, began exploring alternative architectures and specialised hardware such as GPUs, FPGAs and ASICs.
The team developed a system for cloud computing based on the FPGA, which offered better energy efficiency than using CPU or GPU-based systems for the same task. The FPGA also offered this benefit, while not requiring the same level of risk associated with developing a custom ASIC.
Project Catapult’s board-level architecture is designed to be highly flexible. The FPGA can act as a local compute accelerator, an inline processor, or a remote accelerator for distributed computing. In this design, the FPGA sits between the datacentre’s top-of-rack (ToR) network switches and the server’s network interface chip. Network traffic is routed through the FPGA, which can perform line-rate computation on even high-bandwidth network flows.
Using this acceleration fabric, they can deploy distributed hardware microservices (HWMS) with the flexibility to harness a scalable number of FPGAs – from one to thousands. Conversely, cloud-scale applications can leverage a scalable number of these microservices, with no knowledge of the underlying hardware. By coupling this approach with nearly a million Intel FPGAs deployed in datacentres, Microsoft has built a supercomputing-like infrastructure, which can compute specific machine learning and deep learning algorithms with incredible performance and energy efficiency.
In today’s big data era, business and consumers are inundated by large volumes of data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data.
This data comes in a number of formats, from structured, numerical data in traditional databases, to unstructured text documents, email, video, audio and financial transactions.
Effective analysis of this data is key to generating insights and driving better decision making and machine learning (ML) algorithms that are extensively used in modern data analytics. Deep convolutional networks (DNNs), a specific type of ML algorithm, are becoming widely adopted for image classification, as they excel in recognising objects in images, offering state-of-the-art accuracies.
Current generation DNNs, such as AlexNet and VGG, rely on dense floating-point matrix multiplication (GEMM) which maps well to the capabilities of GPUs, with their regular parallelism and high Tflops.
While FPGAs are much more energy efficient than GPUs, (important in today’s IoT market), their performance on DNNs does not match that of GPUs.
A series of tests conducted by Intel evaluated the performance of two latest-generation FPGAs at the time, (Intel’s Arria TM 10 and Stratix TM10) against the latest, highest performance GPU, (Titan X Pascal), on DNN computation.
GPUs have traditionally been used for DNNs, due to the data parallel computations, which exhibit regular parallelism and require high floating-point computation throughput. Each generation of GPU has incorporated more floating-point units, on-chip RAMs, and higher memory bandwidth, in order to offer increased flops.
However, computations exhibiting irregular parallelism can challenge GPUs, due to issues such as divergence. Also, since GPUs support only a fixed set of native data types, custom-defined data types may not be handled efficiently, contributing to underutilisation of hardware resources and unsatisfactory performance.
Unlike GPUs, FPGA architecture was conceived to be highly customisable and, in recent years, five key trends have led to significant advances in FPGAs, bringing their performance closer to state-of-the-art GPUs
Firstly, next-generation FPGAs incorporate much more on-chip RAMs. Secondly, technologies, such as HyperFlex, enable dramatic improvements in frequency. Third, there are many more hard DSPs available. Fourth, the integration of HBM memory technologies lead to an increase in off-chip bandwidth and, finally, next-generation FPGAs use more advanced process technology, such as 14nm CMOS.
The Intel Stratix 10 FPGA has more than 5,000 hardened floating-point units (DSPs), over 28MB of on-chip RAMs (M20Ks), integration with high-bandwidth memories (up to 4x250GB/s/stack or 1TB/s), and improved frequency from the new HyperFlex technology, thereby leading to a peak 9.2 Tflops in FP32 throughput.
FPGA development environments and toolsets are also evolving, enabling programming at a higher level of abstraction. This makes FPGA programming more accessible to developers who are not hardware experts, speeding up the adoption of FPGAs into mainstream systems.
Recent work by Intel studied various GEMM operations for next-generation DNNs. A DNN hardware accelerator template for FPGA was developed, offering first-class hardware support for exploiting sparse computation and custom data types.
The template was developed to support various next-generation DNNs and can be customised to produce optimised hardware instances for FPGA for a user-given variant of DNN.
This template was then used to run and evaluate various key matrix multiplication operations for next-generation DNNs on the current- and next-generation of FPGAs (Arria 10, Stratix 10) as well as the latest, high-performance Titan X Pascal GPU.
The results of this work found that the Stratix 10 FPGA was 10 per cent, 50 per cent, and 5.4 times better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarised DNNs, respectively.
These tests also showed that both Arria 10 and Stratix 10 FPGAs offered compelling energy efficiency (TOP/sec/watt) relative to Titan X GPU, with both devices delivering between three and 10 times better energy efficiency, relative to Titan X.
Although GPUs have traditionally been the undisputed choice for supporting DNNs, recent performance comparisons on two generations of Intel FPGAs (Arria 10 and Stratix 10) and the latest Titan X GPU shows current trends in DNN algorithms may favour FPGAs, and that FPGAs may even offer superior performance.
The paper concludes that: ‘With results showing that the Stratix 10 out-performs the Titan X Pascal, while using less power, FPGAs may be about to become the platform of choice for accelerating DNNs’.