In the past few years, operators of HPC facilities have become keenly aware of power issues; they often spend as much on removing heat as they do to power the servers. While once upon a time power was 15 per cent of the cost of a data centre, today it’s roughly 50 per cent. Although much work is being done on how to better handle and remove heat from facilities, server manufacturers are busy combating the source of the problem – optimising the units that generate the heat in the first place. Limitations of cooling approaches not only affect the direct cost of electricity; they also impact the density with which servers can be packed into a facility. In a data centre, each processor is similar to a 100W light bulb, each server is similar to an electric bar heater, and each rack is equivalent to a domestic heating boiler. In other words, heat-optimised servers can lead to cost avoidance by allowing us to put more compute power in existing infrastructures.
To see the trend in power and heat, consider that the machine leading the Top 500 list in 2002 had 3.7 teraflops in 25 racks with 512 servers and consumed 128kW; five years later, that same machine didn’t even make the list at all – the 2007 leader had 3.7 teraflops in one rack, 53 servers and consumed 21 kW. These days you can now get seven teraflops in one rack with 64 servers for 30kW.
Where’s the power going?
By examining where power in a typical server is going, manufacturers know better where to focus their efforts. We’ve traditionally thought of the CPU as the ‘power hog’ in a server, but that’s not the case. The figure, right, shows the many major contributors to the power budget for a typical server (1U, dual-processor quad-core design for HPC with a total of 16GB of memory). There are many major contributors, and according to Mike Patterson, senior power and thermal architect in Intel’s Eco-Technology Program Office, CPUs are becoming a shrinking part of the pie.
Chip vendors reached the point where they couldn’t provide additional computational power by increasing clock speeds, so they instead focused on adding cores and are trying to keep the power envelope for each chip to a 130W peak.
To give an idea of how chip vendors are doing so, Patterson first points out trends in dielectric materials. Today’s processors with very small features in some places use layers of silicon just five atomic layers thick. In the silicon dioxide dielectrics that separate layers, the amount of leakage current has reached 50 per cent or more, so a processor consuming 80W power uses only half of that for actual computations. New materials and the use of metal gates instead of silicon have resulted in high-k gate dielectrics that reduce gate leakage by over 100 times, so now nearly all of the power is available for computations.
At the CPU level, Intel is not alone in using tricks such as dropping out portions of the CPU when not in use or reducing clock speed based on the load. Certain support circuitry must always run. But with a fourcore device, depending on the application, you might be able to shut down three.
Monitoring temperature on the chip
A somewhat opposite approach is Turbo mode, where the processor looks at its temperature, power and current draw, and if it sees headroom and the load requires it, the chip can increase its clock speed above the rated value. In this way, you can coax more performance from the same amount of power. So while the momentary power consumption isn’t dropped, the figure for performance/watt does go up, meaning you could shut down cores sooner and in that way save power.
Such approaches show the importance of monitoring temperature directly on the chip. With the proprietary vector processor in its SX-9 computer, NEC has implemented a new chip-level temperature management technique. Measuring chip temperature is normally done with a thermal diode, but it must be quite large and it’s not easy to measure the analogue junction current and convert it to a digital signal. NEC instead uses a single large transistor configured to generate only leakage current, which is very temperature sensitive, along with a capacitor and a comparator circuit. The result is a 35 x 35μm temperature sensor with inherently digital output. In the SX-9, a chip does the fine-grained thermal management and shifts loads among the CPUs and scales down the voltage and frequency of each processor to keep temperatures relatively constant.
Issues with power conversion
When considering the power consumed by any component – whether a CPU, memory, hard disk, fan or whatever – note that there are unavoidable losses in the power-conversion steps, meaning that you must always feed more power into the server than it actually can supply to components. Explains Ed Seminaro, IBM chief power HPC hardware system architect, power conversion in a typical server takes place in four steps: the server’s power supply rectifies the incoming AC, boosts it to 400V DC and then bucks it down to 12V DC. These three steps combined can be in the order of 80 per cent efficient in some low-cost servers, but modern components and higher quality designs are bringing this number to 95 per cent. The fourth and final step takes place in voltage regulators on the motherboard, where 12V is bucked down to between 1V-1.5V to power the chips. It is not unheard of for this final step to be only 82 per cent efficient, but market leaders have achieved device efficiencies to 94 per cent. The overall results for these four steps can range from 66 per cent for older bare bones designs to 89 per cent for a best of breed design. While it might sound low, 80 per cent overall was considered a very good number just a short time ago.
While CPUs continue to consume a great deal of energy in servers, other components are also candidates for power savings. (source: Intel Corp.)
Beyond CPUs, a primary concern for thermal issues is becoming memory. Not all HPC applications are as memory bandwidth hungry as one might think, which provides opportunity to use various techniques to reduce power consumption. When a memory chip isn’t being used, it can be put to ‘sleep’, but that has a negative effect on latency. Memory DIMMs today come in various configurations of DRAM bit width and capacity. Chips with wider bit slices and higher capacities are more energy efficient, because fewer are needed to move a given amount of data. However, their use can reduce bandwidth capability depending on the configuration. If some reduction in bandwidth and some increase in latency do not dramatically affect application performance, memory power consumption can be significantly reduced.
Other overhead
Now move on to the overhead due to I/O. Today’s standard HPC system is a two-socket server (where every two sockets make up a node with its own hard disks, I/O controllers and overhead). In contrast, IBM’s Power575 Servers are 16-socket servers where all processors are interconnected so you need only two hard disks for all 16 chips. If a 2.5-inch disk needs 7W, a standard X86 system needs 16 x 7W = 112W, and taking into account conversion losses you need about 1.4x this amount of incoming power, which translates into around 160W. With just two disks there’s a 140W saving.
Savings with IBM’s 16-socket Power575 server are likewise possible due to less interconnect. It’s now possible to get cost effective SMP compute domains large enough where high speed data transfers aren’t necessary between SMPs. That won’t work for every application, but with many CFD applications, for example, everything often fits in one node. One port of Infiniband QDR is about 20W per node, meaning a link requires 40W. For eight two-socket servers you have 8 x 40 x 1.4 = 448W, but one large image needs no interconnect power.
Next, don’t forget fans. These days, most equipment uses brushless DC motors, which easily vary speed to move the required air. The amount depends on how much work the server is doing, tolerances in component power dissipation and thermal resistances, room ambient temperature as well as the altitude – as air gets thinner, you need more to remove a given amount of heat energy. Most servers are designed for a maximum operating altitude of 3,000 to 4,000 metres. Further, the bigger the fan and its blades, the better the energy efficiency. With a 1U server package, though, you’re stuck with lots of smaller fans that aren’t as efficient. IBM’s Blade Center instead uses two to four relatively large air-moving devices to cool 14 blade servers, and IBM’s Power575 uses just two blowers to cool 16 processors.
Another development is the ability for servers to run directly from a building substation transformer anywhere in the world, thereby eliminating the step-down transformer that typically reduces 480V AC to 200-240V AC. The IBM Power 575, for instance, runs from 200-480V AC, removing step-down transformers and the two to five per cent of loss associated with them.
The bottom line, says Seminaro, is that while people are fighting to get a PUE (total facility power / IT equipment power ratio) of 1.3 today, IBM has used the techniques just mentioned to achieve a PUE of 1.18 for a prototype system in a fairly conventional facility – using 480V AC directly from the substation, campus-level chilled water and a water-cooling tower outside the building. In more custom facilities in cooler geographies, IBM will reach a PUE of less than 1.1 with nothing elaborate at all – just a 480V AC subsystem or a simple DC UPS and a cooling tower with water-circulation pumps.
After touching upon all the losses and overheads and talking about how to reduce the processor and memory power before you start, it’s easy to see how power consumed by an HPC computing deployment can be cut by roughly a factor of three. Also, energy efficient high density packaging designs take a small fraction of the space of a traditional system.
Sea of sensors
With its ProLiant G6 servers, HP claims to have the most energy-efficient x86 servers in the industry, doing so with several innovations. One is a ‘sea of sensors’, a collection of 32 smart sensors that dynamically adjust system components such as fans, memory and I/O processing. Next is the Common Power Slot design, which helps minimise power waste by allowing customers to choose from four power supplies to match their workload, and in the process they can achieve more than 92 per cent efficiency in most real-world configurations. Users can, depending on their most urgent requirements, select among 92 per cent supplies at roughly $200 or 94 per cent efficient supplies for double that amount. In this regard, adds Richard Kaufmann, technologist at HP, today a 50 per cent load on a power supply is most efficient, so adding a redundant supply and sharing the load between the two supplies results in peak efficiency.
Iceotope servers use a sealed motherboard compartment containing dielectric coolant fluid to transfer heat to a finned heat-transfer surface that contains water channels for convection.
Adding to savings is Dynamic Power Capping, which reallocates power and cooling resources by dynamically setting the power drawn by servers. By precisely identifying power requirements for each and setting a limit based on that usage, customers can reclaim over-provisioned energy to improve capacity.
Liquid down to the chip
Gases are not as effective as liquids at transporting heat; water is roughly 4,000x better than air in this role. Thus there is a trend to move the cooling water closer to the heat-generating chips. The key to replacing air with liquids is finding a way whereby sensitive electronic circuits are not damaged by direct contact with water. And while some vendors have tried synthetic coolants, these approaches generally recirculate the primary coolant and thus require pumps, complex plumbing and control systems, which puts that approach beyond the cost range of increasingly commoditised hardware.
Sheffield-based start-up Iceotope is working on an approach first demonstrated at the SC09 exhibition that immerses server motherboards in individually sealed baths of dielectric coolant, which then passes through a secondary water circuit. Because the primary cooling stage works with natural convection, that compartment can remain sealed with no moving parts such as fans, pumps or impellers. Pre-assembly burn-in tests are run on motherboards prior to coolant filling to minimise the need to later access the components. The secondary stage moves heat from the rack to the outdoors. In some cases the modules can connect directly to the building’s water circuit, but having an isolated secondary circuit means that pressures, flow rate and temperature can be more carefully managed. The entire system fits within a 10U rack with a tandem arrangement of two modules per position so that 16 modules can fit within a 1m deep 19-inch rack.
With this method, the system has been optimised to work with modern processors that consume 150W or more. It can deliver almost room-neutral cooling with standard racks up to 30kW while requiring as little as 1kW (for a PUE of roughly 1.03 or less) for complete end-to-end cooling overhead without the need for traditional cooling infrastructure. A system that operates at up to 50°C can operate without refrigeration in ambients to 40°C and thus almost anywhere in the world (excluding the tropics). The company estimates that at 300kW and $0.10/kWh, the three-year cost of air cooling would be $788,400. The energy required to cool 600kW of liquid-immersion servers is 20kW, or $52,560 – a saving of 93 per cent. At eight servers/rack, the air-cooled facility requires 125 racks; at 30kW/rack, the liquid-cooled facility needs 20 racks, a space savings of 84 per cent. Other benefits include the ability to reuse the heat, removal of the requirement for a raised floor, containerisation and ruggedisation for extreme environments.