The basic concept behind cluster management is relatively simple: you provision compute nodes with the operating systems, middleware and applications they need and then assign jobs so as to make maximum use of these resources. However, this job is getting ever more complicated, as system architectures add new twists and turns. For example, how are vendors handling the cluster management aspects of cloud computing? To what degree do they support the GPGPUs that are playing an increasingly important role in HPC? The way computer architectures are evolving so rapidly, it seems like achieving effective provisioning and workload management is like shooting at a moving target, only the target continues to move faster and faster.
Further, adds Steve Conway, research VP for technical computing at market-research firm IDC, hardware is advancing and growing in scope faster than people can exploit it, and all the pressure is now on the software side. The largest clusters today have 250,000 processors and they’re headed to a million, while some people are talking about eventually reaching a billion cores – and the task of keeping all of them running at peak efficiency is growing just as rapidly, if not more so.
Another factor is that, because of all the new technological options available, there’s pressure on companies to add additional heterogeneous resources by including nodes with GPUs, Cell processors, high-speed networking or whatever else is the latest technology. At the same time, there’s pressure to consolidate computing resources, and running multiple specialised heterogeneous systems isn’t cost-effective. All applications, however, don’t run best on one size or type of system. Flexibility is becoming the operative word.
Impact on cluster management
All this has certainly had an effect on cluster management in a big way. Gary Tyreman, senior VP of products and alliances at Univa UD, notes that a year ago cluster managers were focusing on how to deal with bare metal, but since then the industry has been on an amazing journey into virtualisation and cloud computing. He points out the need for software to be able to do checkpointing (saving a snapshot of a virtual machine in the event a machine fails during a long job), swapping machines or growing/shrinking the number of CPUs in a virtual machine to reduce swapping, and mobility. It’s also desirable to have the ability to migrate a job among machines so that you can, for instance, take a machine out for repair without disrupting jobs. And a typical scientist or engineer should experience only negligible performance degradation when switching from a bare metal machine to a virtual machine. The key is provisioning bare metal to virtual machines with a policy engine that takes into account the infrastructure and business rules. Now everyone seems to be building a cluster, for instance, Univision has rewritten its UniCloud to support all of these: bare metal, virtual machines, as well as cloud suppliers including Amazon and Rackspace, to handle ‘bursting’.
In thinking about provisioning and workload management, however, one must be specific about the differences between grid computing and cloud computing. According to Martin Harris, director of product management at Platform Computing, enterprise grids typically enable infrastructure services as a utility – ask any grid veterans, he adds, and they’ll claim to have been doing cloud-like computing for years. But when you look at the definition of IaaS (internet as a service) that most analysts adhere to – elastic application environments, internet self-service, pooling of resources, and metered usage – then ‘grid’ is missing the first two elements.
As for the first element, grids are not really elastic, meaning applications that share the same OS build will also typically share the same grid in a relatively static manner. You’ll need multiple grids (or dedicated resource pools within the same grid) to handle the different OS builds that apps require. When adding virtualisation and/or physical host provisioning to the mix, you now enable repurposing of hosts based on policies that understand specific requirements of each application. For example, you might have a high-priority application on Windows that needs more resources at a specific time. Previously this application team would have been out of luck, but now the grid can serve up resources that are repurposed from a lower priority environment for this need. A set of Linux boxes can be quickly modified by tearing down the Linux virtual machine (VM) and firing up the Windows VM, or by rebuilding from bare metal with tools like dual boot, and the ‘new’ Windows servers will automatically join the app environment.
For the second element, grids provide APIs and command line interfaces (CLIs) for job submission, but don’t really provide self-service tools to request infrastructure on demand or adapt to calendar-based requirements. This is where the addition of an IaaS self-service resource request framework via a web portal, API and CLI enable application teams to more flexibly request and manage the resources they need. In this regard, Harris notes how Platform Computing’s cooperation with CERN demonstrates how that organisation is delegating more control out to the application teams to provide better performance and responsiveness for the services they provide. CERN also benefits from the automation required to make these self-service components operate (specifically application service definitions that contain the instruction set to auto-provision and auto-scale full app environments), significantly reducing IT manual labour required to service their application teams.
Reprovisioning a system for a different OS with Platform ISF Adaptive Cluster can help adapt to changing usage patterns.
To address such needs, Platform Computing recently announced the Platform ISF Adaptive Cluster, which turns static clusters and grids into dynamic, shared environments using heterogeneous physical and virtual HPC resources. It allocates resources dynamically, based on Platform LSF and Platform Symphony workload demands. This product helps eliminate cluster and queue sprawl, removes application stack silos and reduces large job starvation.
New names reflect new realities
There’s such a sea-change taking place that at least one company has changed its name to reflect the new environment. Previously known as Cluster Resources, the company is now Adaptive Computing, with its well-known Moab product line. Particularly relevant is Moab Adaptive HPC Suite, which likewise creates an adaptive operating environment that responds to changing requirements of applications and workloads. It allows a compute environment to dynamically accommodate workload surges. To do so it changes a node’s operating system as well as software and other resources, on the fly, in response to workload needs. It automatically triggers an OS change on the needed number of nodes using a site’s preferred OS-modification technology – whether it be dual boot, diskful or stateless provisioning.
Beyond the operating system these other resources can take a number of shapes, one example being networking. Moab can be aware of the network topology and route jobs to nodes or clusters that have high throughput for those jobs that require it. When it comes to GPUs, you configure the software to identify which resources have Cuda processors. Some applications can work in both forms, but you don’t want to block them if certain resources are busy, but you can set up an affinity to a particular resource.
A final point in this regard is adaptive energy savings. With good cluster management, says Adaptive Computing’s president Michael Jackson, you can make maximum use of nodes already running and it should never be necessary to wait for a power-on cycle to take place. He says that one customer saved enough in energy bills in one month to pay for the software.
With grid computing, you can provision computing resources as a utility that can be turned on or off. Cloud computing goes one step further with on-demand resource provisioning.
Another name-change to be aware of is Bright Computing, which is a spin-out from the European HPC cluster company ClusterVision. This company specialises in provisioning clusters, but for workload management it works with third parties.
For GPUs, the Bright Cluster Manager installs Cuda libraries in the proper places. An ‘environment module’ allows you to manage multiple versions and ensures that an application or compiler knows where to find everything it needs. For GPUs, says commercial director Matthijs van Leeuwen, assume user A and user B need two different versions of Cuda; without the environment module you would have to state this explicitly. In addition, with the ‘rack view’ feature you can visualise any metrics that are made available to the operating system including teraflops or the chip’s temperature.
Cluster management software can also include a total view of all the cores, here showing their temperature with ‘rack view’ from the Bright Cluster Manager.
The company also touts Bright Health, which addresses that fact that an HPC cluster evolves as hardware and software are subject to change. An interesting feature is the ‘prejob checker’, which consists of a number of tests that run in a few seconds before a job starts. If a test fails, the faulty node is taken offline in the workload manager, the administrator is notified and the job is re-queued – the workload manager is not flushed empty if one of the nodes is faulty, and the job doesn’t have to go to the bottom of the queue.
All on a single DVD
Like the previous company mentioned, Clustercorp specialises primarily in provisioning, but company president Tim McIntire says their main value added is putting all the pieces needed into one distribution, which is called a Rocks. That, in turn, uses a paradigm called ‘Rolls’, which automates the installation of software across a cluster and prevent ‘software skew’. A Roll can contain any software package built for Linux. Clustercorp’s Rolls include critical stacks from suppliers such as Mellanox (OFED Infiniband), Intel (compilers), Portland Group (compilers), Platform (LSF), Cluster Resources (Moab), TotalView (debugger), Nvidia (Cuda) and Panasas (storage). Further, with the Xen Roll you can use the same tools to spin off virtual clusters inside virtual clusters, each with a VPN (virtual private network). When you provision a node for the first time, you select an appliance type such as a compute node, storage node or, for GPUs, a Cuda node. During provisioning, each Cuda appliance automatically gets the required modules.
With GPUs there is one issue to be aware of, points out Platform Computing’s director of product marketing William Lu, before cluster management software can do a complete job. Specifically, the software needs to detect when a particular GPU is used by a specific application to determine when capacity is or will be available. Today it’s possible to detect the number of GPUs in a node, and that’s a static assignment; however, Cuda provides no indication whether a particular core is in use. Without this information, users might make wrong assumptions about GPU usage and inadvertently leave a GPU idle or try to use two different GPUs – the cluster software just doesn’t know. He relates that his company is working with Nvidia to resolve this technology gap, and future versions of Cuda are likely to have enhancements along these lines.
Another limit is that, today, a workload manager can only send a job and hope that the application benefits from the selected environment. The advice from Jochen Krebs, director of enterprise solutions sales for Altair, is to consider the application and your users (their privileges and rights) and then put in a definition into a site policy that makes sure to reserve GPUs for applications that benefit most from them.
Built with the OS
Besides those firms who provide cluster management software for available Linux operating systems, some are developing an OS with the cluster software already included. One such company is T-Massive Computing, Russia’s largest HPC supplier, with its Clustrx HPC OS. It includes a POSIX-compliant resource manager (based on SLURM, the open-source Simple Linux Utility for Resource Management) along with deployment, job-scheduling, monitoring, power-management, cluster management and provisioning subsystems. This software is being used on the 420-teraflop Lomonosov supercomputer at Moscow State University, a system with 35,776 cores and currently in the 12th position in the current TOP500 list. Clustrx does provide hybrid and GPU-based support with a hybrid MPI, and the firm is working to enhance it with adaptive task management and scheduling including dynamic task profiling as well as some level of GPU (or any other accelerator) node virtualisation.