HPC systems have all but turned away from a single giant piece of iron in favour of distributed computing using commercial off-the-shelf servers with hundreds or even thousands of nodes, all of which must all be interconnected. For many people configuring an HPC system, their first thought when it comes to interconnect might be to use the Ethernet ports that are standard on virtually every server and thus are essentially ‘free’, because they seemingly don’t add any cost to the system. But does this logic really hold water?
In many cases, it does not. By adding an interconnect that transfers data at much greater speeds and with lower latency, you can improve system performance to the point where applications run much more quickly, saving engineering and design time, and additional servers might not be necessary.
What, then, are these high-performance alternatives to Ethernet? If you check the latest Top500 list of supercomputers from June, it’s true that more than half use gigabit Ethernet – but that also says that almost as many do not. The dominant alternative is InfiniBand, which has overtaken other schemes such as Myrinet and, in this latest list, increased its share somewhat to 28 per cent of all Top500 systems, more than all other alternatives to Ethernet combined. In addition, a recent study by Tabor Research indicates that InfiniBand is the clear leader in all new HPC shipments, stating that 60 per cent of surveyed HPC systems installed since the start of 2007 use it as a system interconnect. The market research company IDC adds that InfiniBand is the only true unified wire solution shipping in volume today, with approximately five million ports total to date.
You might not be aware, but InfiniBand is already a decade old. It evolved from the merger of two earlier designs: Future I/O (developed by Compaq, IBM and Hewlett-Packard) and Next Generation I/O (developed by Intel, Microsoft and Sun). While the InfiniBand Trade Association has more than three dozen members, at this time, there are three major suppliers of InfiniBand adapters: Mellanox, QLogic and Voltaire; of these, Mellanox was, until recently, the sole supplier of InfiniBand chips, but QLogic has also begun supplying the required silicon for these interfaces.
What is driving this growth? As Figure 1 shows, the primary factor is raw speed with data-transfer rates now at 40Gb/s along with very low latency. In reviewing various products, you’ll encounter several flavors of InfiniBand: SDR (single data rate) at 10Gb/s, DDR (double data rate) at 20Gb/s and just recently QDR (quad data rate) at 40Gb/s. InfiniBand can continue to evolve, adds Gilad Shainer, director of technical marketing at Mellanox, who has seen this technology double its speed every two years. By the end of next year he predicts that EDR (eight data rate) InfiniBand at 80Gb/s will become available. He contrasts that to the slower development of Ethernet, where the spec for 10Gb Ethernet was done in 2003 and we are just now starting to see widespread implementations, and that scheme should dominate for the next five to six years.
Price is also a factor, and as shown in the table – counter to what you might think – the hardware cost for extra speed drops. For instance, IDC also states that when compared to a single 40Gb/s QDR InfiniBand fabric, traditional fabrics can more than double what it costs to operate and manage the I/O behind a virtual server infrastructure.
What is InfiniBand?
InfiniBand is a point-to-point, switched I/O fabric architecture where each communication link extends between only two devices, whether two processors or between a processor and an I/O device such as mass storage, and where each has exclusive access to the communication path. Data moves directly into application memory, thus accounting for the scheme’s low latency. By adding switches, multiple points can be interconnected to create a fabric. As more switches are included, the fabric’s aggregated bandwidth increases. By adding multiple paths between devices, switches also provide a greater level of redundancy.
As clusters and the number of processors per cluster grow to address increasing complex HPC applications, the communication needs of applications increase dramatically. This causes interconnect performance to become a crucial factor in successfully scaling applications. An interconnect that efficiently transfers messages between cluster nodes is required to allow applications to scale.
Figure 1: Comparison of the most popular HPC interconnect schemes. Data courtesy of Voltaire Inc.
InfiniBand is designed to be scalable. Typical Ethernet environments take a multi-tier approach using Layer 3-7 services whereas InfiniBand switching uses a relatively flat Layer 2 architecture. ‘You can build a very large data centre with thousands of nodes with only two tiers of switches,’ explains Asaf Somekh, Voltaire’s VP of marketing, ‘but with competitive Ethernet solutions you need three, sometimes four tiers to build something of similar size.’ The advantage of a flatter architecture is not only reduced costs, but also lower latency because of fewer hops in the data path.
How does InfiniBand achieve its high speed and low latency? Put simply, an Ethernet port uses the host CPU to control the flow of data, whereas InfiniBand adapters have hardware dedicated to that task so the system CPU is freed up; InfiniBand bypasses the CPU and operating system for data transfers.
Another factor is InfiniBand’s use of RDMA (Remote Direct Memory Access), which allows data to move directly from the memory of one computer into that of another without involving either one’s operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer.
The issue of efficient data transfer is becoming more critical as multi-core servers become the standard. It’s necessary to move data in and out of each of these processors with the lowest possible latency so that the CPU does not sit idle while waiting for other CPUs to be serviced with data. Applications and the problems to solve are getting more complex, and many machines are running multiple jobs in parallel, but the load on the network is increasing.
Ethernet will continue to play a major role in all types of computing. It is designed to connect a wide variety of devices; from PCs, to telephones, to video security cameras, to printers, to servers, to a common data centre network. Furthermore, it is much more common, has thousands more devices that can connect to it and it stills offers the best plug-and-play experience along with backward compatibility in legacy systems and applications. Ethernet has a market of roughly $2bn compared to hundreds of millions for InfiniBand.
Ethernet suppliers like to point out what they perceive as issues with InfiniBand. For instance, says Charles Ferland, VP for the EMEA Regions for Blade Network Technologies, InfiniBand is not always as fast as advertised; while the QDR version claims 40Gb/s performance, he counters that it actually delivers closer to 26Gb/s due to hardware limits in the Gen 2 PCIe bus, which accommodates InfiniBand adapters. He adds that InfiniBand certainly doesn’t ‘work and play well’ with others, and that while it might be the obvious choice in the HPC world, compared to well-known and widely used Ethernet standards, the comfort level of InfiniBand might be equated to ‘sleeping on a bed of nails’.
Look at network efficiency
In comparing InfiniBand and Ethernet, states Voltaire’s Somekh, one of the most important parameters people should look at is network efficiency; what is the impact of the network on application efficiency? This simple metric, he believes, articulates everything about this alternative approach. With large data transfers, Ethernet consumes as much as 50 per cent of the CPU cycles; the average for InfiniBand is a loss of less than 10 to 20 per cent. So, while you might not have to pay more in hardware costs to implement an Ethernet network, for HPC you will spend longer running applications to get results, which means extra development and analysis time, or you might end up purchasing extra compute nodes to provide the horsepower.
For specific details, first consider some statistics we reported in a previous article in this magazine (‘When models outgrow hardware, turn to HPC’, October/November 2008). In addition, a number of examples are available on the website of the HPC Advisory Council (www.hpcadvisorycouncil.com/best_practices.php). Among them is one running the Fluent package from Ansys to compute the Eddy_417K benchmark (a reacting flow case using the eddy dissipation model with approximately 417,000 hexahedral cells). As Figure 2 shows, InfiniBand shows increasing improvement as the number of nodes increases, reaching 192 per cent improvement at 24 nodes.
Figure 2: Comparison of Fluent software running a common benchmark on a system with InfiniBand vs. one with Ethernet (source: HPC Advisory Council).
There are also power savings to be had, and this is critical when HPC facilities are confronting major issues with power supplies, cooling and costs. The same study indicates that InfiniBand cuts power costs considerably to finish the same number of Fluent jobs compared to Gigabit Ethernet; as cluster size increases, more power can be saved.
Products becoming more flexible
InfiniBand adapters aren’t something you’ll find in the local computer store, as most are sold to system developers and server manufacturers. These are generally bus-based products using schemes such as PCI-Express, and they typically use microGiGaCN or QSFP (Quad Small Form-factor Pluggable) connectors. To help people make the transition as they move up to InfiniBand, vendors have developed adapter cards that support multiple schemes. Mellanox’s ConnectX adapters, for instance, provide two ports, each of which features auto-sense capability so a port can identify and operate on 40Gb/s InfiniBand or 10Gb Ethernet with FCoE (Fibre Channel over Ethernet).
Once you’ve set up an InfiniBand link on a server, you connect it to a switch to link to other servers. Here InfiniBand supports a very flat architecture compared to Ethernet, and one switch can address hundreds of servers, with models from all major suppliers now supporting 864 InfiniBand ports as the current high-end standard.
For instance, Voltaire’s Grid Director 4700 has as many as 864 ports of 40Gb/s InfiniBand connectivity. The switch’s HyperScale architecture provides an inter-switch link capability for stacking multiples of 324 ports to form scalable fabrics. The company is also proud of its Unified Fabric Manager (UFM) software, which automatically discovers, virtualises, monitors and optimises the fabric infrastructure and accelerates active applications. This approach – where the network software is not installed on individual adapters, but runs on one server in a cluster, which then talks with all the switches and servers, and thus there is no need to change the application software – is common to other switches, as well. In this way, the management software can isolate traffic of applications from storage and maintain application performance.
Recently announcing an 864-port product is Mellanox, whose IS5600 40Gb/s switch supports adaptive routing (which dynamically and automatically reroutes traffic to alleviate congested ports) and static routing (which produces superior results in networks where traffic patterns are more predictable). In turn, QLogic’s Model 12800 features as many as 864 InfiniBand ports at 40Gb/s with support for DDR and SDR versions; the firm also believes it provides the current best power performance at 7.8W per port.
Another interesting group of products address the issue of interconnect convergence. Some clusters today use one fabric for data centre operations (Ethernet), another for data transfers (InfiniBand) and yet another for data storage (Fibre Channel). If you want to reduce infrastructure costs, reduce the number of nets, why not instead use just one wire – the InfiniBand pipe at 40Gb/s – for all three? That has become possible today, and Voltaire says that almost 90 per cent of its customers use the same wire for data and a scalable file system.
Along these lines, an interesting product is the Mellanox BridgeX gateway, which delivers a fabric-agnostic I/O consolidation scheme that allows end users to run applications over 40Gb/s InfiniBand or 10Gb Ethernet as well as 8Gb/s Fibre Channel. By bridging all three schemes on a single 40Gb fabric, Mellanox makes it possible for companies to migrate to a converged, higher-speed data centre fabric without abandoning legacy investments in adapters and switches.
InfiniBand features on Ethernet
Another trend is the effort to add InfiniBand features such as RDMA to improve Ethernet performance in what is sometimes called CEE (Converged Enhanced Ethernet), DCB (Data Centre Bridging) or DCE (Data Centre Ethernet). By the end of the year, Voltaire plans to start shipping its Model 8500, a CEE compliant non-blocking switch. Voltaire claims that its 10 GigE gear will deliver up to 10x lower latency, 4x the core capacity and consume 3x less power than a comparable solution on a 1,000-node server setup – and for half the cost. Similarly, Mellanox recently demonstrated its LLE (low-latency Ethernet) scheme running over its ConnectX Ethernet adapters with a latency of 3μsec.