Nvidia’s GPU Technology Conference (GTC) takes place this week in the US, with the company launching several new technologies aimed at increasing AI performance – but they are not the only company with new technology that could stir up renewed competition in AI and HPC.
As with previous years, the GTC took place in Silicon Valley with Jensun Huang, CEO, delivering a keynote on the second day of proceedings detailing Nvidia’s latest developments.
There were several notable announcements focusing on the development of graphics, medical imaging, autonomy and self-driving technology, but there were two announcements around AI and datacentre products, such as a new 32GB NVIDIA Tesla V100 GPU and GPU interconnect fabric called NVIDIA NVSwitch, which enables communication between 16 Tesla V100 GPUs.
Jensun Huang, CEO and founder of Nvidia, noted that there has been huge development in GPU technology over the past five years. Comparing the Tesla K20 with today’s high-end devices, Huang stressed that GPU development had continued to grow at a faster rate than Moore’s Law.
Huang said: ‘Clearly the adoption of GPU computing is growing and it’s growing at quite a fast rate. The world needs larger computers because there is so much work to be done in reinventing energy, trying to understand the Earth’s core to predict future disasters, or understanding and simulating weather, or understanding how the HIV virus works.’
In an interview with Scientific Computing World Ian Buck, vice president, Tesla Data Center Business at NVIDIA, stated that the announcements centred on AI demonstrate the focus the company has to delivering ‘GPUs with more capabilities and more performance’.
‘AI has been at the forefront of changing the way we interact with devices and companies are transforming their businesses with new AI services,’ said Buck.
‘Just five years ago we had the first AlexNet, which was the first neural network to become famous for image recognition. Today’s modern neural networks for image recognition – such as Inception-v4 from Google – are up to 350 times larger than the original AlexNet. It is larger because it is more intelligent, more accurate and it can recognise more objects,’ added Buck.
Nvidia announced two key hardware announcements alongside a host of other improvements such as TensorRT 4 and advancements to its DRIVE system. Key advancements to the NVIDIA platform – which, Huang stated, had been adopted by every major cloud service provider and server maker, were the new GPU and GPU fabric NVSwitch.
Buck explained that the Tesla V100 GPU ‘will deliver the same performance in terms of floating point performance, same number of Tesla cores, mechanicals and form factor. We simply have more memory now, and that memory allows us to train some of those larger and more complicated neural networks.’
‘We will still offer the 16 GB but we will now also offer the 32GB version. Because we kept the form factor identical it has been very easy for our channel partners to accept this product and add it to their product line and they will be making their own announcements this week at GTC,’ he added.
Going further
‘Today’s high end AI systems incorporate eight of our Tesla V100 GPUs connected with NVLink, through a hybrid cube mesh topology. ‘Can we go higher? Can we add more GPUs to provide our AI community with even more powerful GPUs for training?’ asked Buck.
‘One of the challenges to that is how do we scale up NVLink? When we first introduced NVLink it was a point-point high speed interconnect offering 300 GB/s bandwidth per GPU – we directly connected all of the GPUs together in what we called a hybrid cube mesh topology.
‘To go further we need to expand the capabilities of our NVLink fabric. With that we have invented a new product which we call the NVSwitch. This enables a fully switched device for building an NVLink fabric allowing us to put up to 18 NVLink supports with 50Gb/s per port giving a grand total of 900 Gb/s of bandwidth in this fully connected internal crossbar which is actually a two billion transistor switch.
The NVSwitch fabric can enable up to 16 Tesla V100 GPUs to communicate at a speed of 2.4 terabytes per second. Huang declared that ‘the world wants a gigantic GPU, not a big one, a gigantic one – not a huge one, a gigantic one.
Combining these new technologies into a single platform gives you Huang’s next announcement the NvidiaDGX-2. Huang explained to the crowd at the conference keynote that the DGX-2 is the first single server capable of delivering two petaflops of computational power.
DGX-2 has the deep learning processing power of approximately 300 servers occupying 15 racks of datacentre space, while being 60 times smaller and, the company claimed, up to 18 times more power-efficient.
Disrupting HPC
As noted in the processor feature in February/March issue of Scientific Computing World Intel is developing FPGA technology through the acquisition of Altera and the newly formed Intel Programmable Solutions Group. Later this year the company will release a programmable accelerator card (PAC) A PCIe based FPGA accelerator card based on the Aria 10 FPGA.
PAC connects through PCIe Gen3 to hook into the server and comes with 8 GB of DDR4 memory, along with 128 MB of flash. Intel will also provide a supporting software stack to provide additional support and encouragement for users to pick up new technology. Some of this will be based on Intel’s own efforts known as the Open Programmable Acceleration Engine (OPAE) Technology, but support for OpenCL will also be included.
While this will go a long way to overcoming some of the trepidation around the use of FPGA technology Intel PSG are still targeting specific applications to demonstrate the potential for FPGAs.
Mike Strickland, director, solutions architect, Intel Programmable Solutions Group, noted that networking acceleration is one key aspect that the company are exploring as the additional bandwidth provided by FPGAs could be useful in this area.
‘You can take networking traffic directly into the FPGA or the FPGA can access data directly. For instance, if you have an FPGA connected to a Xeon processor in the Xeon there is an embedded PCIe express switch they could access NVMe drives and directly pull data from these drives. For data analytics’ acceleration, there is really no way to achieve that level of performance without this multifunction play.
‘Some of the things we have been talking about for HPC carry over into the embedded space. One area where there is an absolute overlap is in military intelligence and traditional HPC. There is about an 80 per cent overlap because they have a lot of the same problems,’ added Strickland.
The addition of HBM2 memory to FPGAs could also be an important step as it will allow higher data use and bandwidth. However, Strickland noted that although it will be included with the PAC card there has been less of a need for it on applications than would be expected.
‘I can say that two years ago I thought I was going to need the HBM memory for machine learning and things like matrix multiplication but we have really done a good job of reusing data on the FPGA. For instance in machine learning if you have got a model with 20 to 40 layers we will pick part of the image and drill through all the layers as well, which gives us a really high data reuse rate,’ said Strickland.
‘Beyond networking, there are opportunities in data analytics so you can kind of think of it as a caching tier now – a very high bandwidth caching tier – so you can expect to see some applications there as well,’ Strickland added.
Intel PSG is also targeting AI research and recently announced how Bing Intelligent Search is using FPGA technology is powering AI platforms.
Reynette, Au vice president marketing for Intel’s Programmable Solutions Group commented: ‘In today’s data-centric world, users are asking more of their search engines than ever. Advanced Intel technology gives Bing the power of real-time AI to deliver more intelligent search to users every day. This requires incredibly compute-intensive workloads that are accelerated by Microsoft’s AI platform for deep learning, Project Brainwave, running on Intel Arria and Intel Stratix FPGAs.’
‘Intel FPGAs power the technology that allows Bing to quickly process millions of articles across the web to get you contextual answers. Using machine learning and reading comprehension, Bing will now rapidly provide intelligent answers that help users find what they’re looking for faster, instead of a list of links for the users to manually check.’
Intel is keen to demonstrate that it is not just GPUs that can deliver AI performance and the repetitive nature of tasks like inference and training or AI deep learning networks could prove a good candidate for FPGA optimisation.
‘Intel FPGAs are making real-time AI possible by providing completely customisable hardware acceleration to complement Intel Xeon CPUs for computationally heavy parts of the deep neural networks while retaining the flexibility to evolve with rapidly changing AI models and tune to achieve high throughput and maximum performance to deliver real-time AI,’ concluded Au.
While some of the biggest players in both HPC accelerators and CPUs are waging war over the AI revolution, AMD has been quietly developing its latest CPU platform EPYC to compete with the latest Xeon CPUs.
The company had success in the headlines at the end of last year when the new range of server products were released but, as Greg Gibby, senior product manager of data center products at AMD notes, that he expects the company will begin to see some momentum as several ‘significant wins’ have already been completed.
‘If you look at the HPC customer set these guys tend to be leading-edge. They are willing to go do ‘unnatural acts’ to go get every ounce of performance out of an application. You have seen that through things like GPU or FPGA acceleration where HPC users are willing to go change code and other things to go get every bit of performance they can,’ said Gibby.
‘I believe that as we get customers testing the EPYC platform on their workloads they see the significant performance advantages that EPYC brings to the market. I think that will provide a natural follow through of us gaining share in that space. Towards the timing and how soon that will all happen, I cannot say specifically but I think you will see in 2018 that we will make some pretty significant advances,’ Gibby added.
Gibby noted that one thing he thought would help to drive HPC customers to the AMD platform would be the number of PCIe lanes available to users in a single socket: ‘We have a no-compromise, one-socket system, and what I mean by that is if you look at the way our competitor has designed its platform the PCIe lanes are tied to the CPU, and with Sky Lake you get 48 PCIe lanes on each processor. If you wanted to attach a number of GPUs or if you wanted to have a lot of NVMe drives you have got to buy a two-socket system.’
‘Now you have got some latency where you have to go from socket to socket and a lot of times you end up with limitations on the PCIe lanes, which require you to put in a switch. With our solution because you get 128 PCIe lanes off that single socket you no longer have that limitation. You can put 24 NVMe drives or up to six GPUs all on a single socket without having to introduce latency and cost adding PCIe switches.’
‘That is one thing that is a little bit unique which I think could be potentially disruptive in the market when it comes to the HPC side,’ Gibby concluded.