At the Hot Chips conference - held this week in California - Microsoft revealed its latest deep learning acceleration platform, known as Project Brainwave.
Microsoft has designed this system around ‘real-time AI’ which means the system processes requests as fast as it receives them, with ultra-low latency using Intel’s Stratix 10 FPGAs.
In a recent blog post, Doug Burger, Distinguished Engineer at Microsoft stated: ‘Real-time AI is becoming increasingly important as cloud infrastructures process live data streams, whether they be search queries, videos, sensor streams, or interactions with users.’
‘By attaching high-performance FPGAs directly to our datacentre network, we can serve DNNs as hardware microservices, where a DNN can be mapped to a pool of remote FPGAs and called by a server with no software in the loop’ Burger commented. ‘ This system architecture both reduces latency, since the CPU does not need to process incoming requests and allows very high throughput, with the FPGA processing requests as fast as the network can stream them’ he added.
This is just one of many applications using FPGA’s that Microsoft has deployed since it acquired the FPGA manufacturer in 2015. The Stratix 10 is the latest FPGA product from Intel and is built using a 14nm fabrication process.
‘Even on early Stratix 10 silicon, the ported Project Brainwave system ran a large GRU model—five times larger than Resnet-50—with no batching, and achieved record-setting performance’ stated Burger. ‘The demo used Microsoft’s custom 8-bit floating point format (“ms-fp8”), which does not suffer accuracy losses (on average) across a range of models. We showed Stratix 10 sustaining 39.5 Teraflops on this large GRU, running each request in under one millisecond.’
‘Running on Stratix 10, Project Brainwave thus achieves unprecedented levels of demonstrated real-time AI performance on extremely challenging models. As we tune the system over the next few quarters, we expect significant further performance improvements’ Burger concluded.
The Project Brainwave system is built with three main layers: A high-performance, distributed system architecture; a hardware DNN engine synthesised onto FPGAs; and a compiler and runtime for low-friction deployment of trained models.
Burger explained that initially, Project Brainwave leverages the FPGA infrastructure by attaching high-performance FPGAs directly to data centre network, which allows Microsoft to serve ‘DNNs as hardware microservices, where a DNN can be mapped to a pool of remote FPGAs and called by a server with no software in the loop.’ This system architecture helps to reduce latency and allows very high throughput.
Next the system uses ‘soft’ DNN processing unit (or DPU), synthesised onto commercially available FPGAs. ‘A number of companies—both large companies and a slew of startups—are building hardened DPUs. Although some of these chips have high peak performance, they must choose their operators and data types at design time, which limits their flexibility’ said Burger.
However this new system from Microsoft takes a different approach, as the company claims that this design scales across a range of data types, with the desired data type being a synthesis-time decision.
‘The design combines both the ASIC digital signal processing blocks on the FPGAs and the synthesisable logic to provide a greater and more optimised number of functional units. This approach exploits the FPGA’s flexibility in two ways. First, we have defined highly customised, narrow-precision data types that increase performance without real losses in model accuracy. Second, we can incorporate research innovations into the hardware platform quickly (typically a few weeks), which is essential in this fast-moving space’ stated Burger.