As the complexity of HPC systems continues to increase the effective management of these systems becomes increasingly critical to maximising the return on investment in scientific computing.
Ensuring that a cluster is utilised efficiently is a complicated job. However, employing the latest software can reduce the burden of supporting an HPC cluster, reduce the number of people needed to manage the resource, or allow more science and engineering to be completed by utilising the resource more efficiently.
The options available for cluster management software are as varied as the types of computing systems on which they can be deployed. Whether you are an academic institution leveraging open-source software due to budget restrictions, or a commercial company paying for software with additional support and maintenance, choosing the right software package can save key resources.
However, there are several tried and tested approaches that are well suited to HPC and AI frameworks that are outlined below. Many of these tools are available in both open and commercial licenses depending on the required support level.
Products available
Aspen Systems has experience installing high-performance computing software stacks. Aspen Systems offers a proprietary command-line-driven cluster management software, Aspen Cluster Management Environment (ACME). It also provides Nvidia Bright Cluster Manager, a commercial product, and many other open-source solutions such as xCAT2, Warewulf, and OpenHPC with configuration management software, monitoring capabilities, containers, and package management tools for scientific software found in supercomputers.
Aspen Systems Cluster Management software comes standard with all of its HPC clusters, along with its standard service package at no additional cost. Aspen Cluster HPC Management software is compatible with most Linux distributions and is supported for the life of the cluster.
Nvidia Bright Cluster Manager offers fast deployment and end-to-end management for heterogeneous high-performance computing (HPC) and AI server clusters at the edge, in the data centre, and in multi/hybrid-cloud environments. It automates provisioning and administration for clusters ranging in size from a couple of nodes to hundreds of thousands, supports CPU-based and Nvidia GPU-accelerated systems, and enables orchestration with Kubernetes.
Nvidia Bright Cluster Manager allows you to deploy complete Linux clusters over bare metal and manage them reliably, from edge to core to cloud. Providing cluster management solutions for the new era of high-performance computing (HPC), Nvidia Bright Cluster Manager combines provisioning, monitoring, and management capabilities in a single tool that spans the entire lifecycle of your Linux cluster.
Advanced Clustering Technologies has designed ClusterVisor to enable you to easily deploy your HPC cluster and manage everything from the hardware and operating system to software and networking using a single GUI.
Their full-featured ClusterVisor tool gives you everything you need to manage and make changes to your cluster over time. ClusterVisor is highly customisable to ensure you can manage your cluster and organise your data in a way that makes the most sense for you.
eQUEUE from ACT is a software solution that allows system administrators to create easy-to-use, web-based job submission forms. It is designed to increase cluster utilisation by bringing more users to the cluster who would ordinarily stay away due to the complexity of submitting jobs to a cluster. There is no need to learn Linux or scripting. The end user simply inputs their data into predefined fields and the job is now in the cluster’s queue to run.
The Scalable Cube from HPC Scalable is an enterprise ready, supported distribution of an open-source workload scheduler that supports a wide variety of HPC and analytic applications. Whether deployed on site, on virtual infrastructure, or in the cloud, customers can take advantage of top-quality support services from HPC Scalable, helping ensure the success of managing their HPC workloads.
Azure high-performance computing (HPC) from Microsoft is a complete set of computing, networking and storage resources integrated with workload orchestration services for HPC applications. With purpose-built HPC infrastructure, solutions and optimised application services, Azure offers competitive price/performance compared to on-premises options with additional high-performance computing benefits. In addition, Azure includes next-generation machine learning tools to drive smarter simulations and empower intelligent decision making.
Adaptive Computing’s Moab HPC Suite is a workload and resource orchestration platform that automates the scheduling, managing, monitoring, and reporting of HPC workloads on massive scale. Its patented intelligence engine uses multi-dimensional policies and advanced future modelling to optimise workload start and run times on diverse resources.
These policies balance high utilisation and throughput goals with competing workload priorities and SLA requirements, thereby accomplishing more work in less time and in the right priority order. Moab HPC Suite optimises the value and usability of HPC systems while reducing management cost and complexity.
Omnia is a deployment tool to configure Dell EMC PowerEdge servers running standard RPM-based Linux OS images into clusters capable of supporting HPC, AI and data analytics workloads. It uses Slurm, Kubernetes and other packages to manage jobs and run diverse workloads on the same converged solution. It is a collection of Ansible playbooks, is open-source, and is constantly being extended to enable comprehensive workloads.
Altair PBS Professional is a workload manager designed to improve productivity, optimise utilisation and efficiency, and simplify administration for clusters, clouds, and supercomputers – from the biggest HPC workloads to millions of small, high-throughput jobs. PBS Professional automates job scheduling, management, monitoring, and reporting, and it’s a trusted solution for complex Top500 systems as well as smaller clusters.
Altair’s Univa Grid Engine is a distributed resource management system for optimising workloads and resources in thousands of data centres, improving performance and boosting productivity and efficiency. Grid Engine helps organisations improve ROI and deliver faster results by optimising the throughput and performance of applications, containers and services while maximising shared compute resources across on-premises, hybrid and cloud infrastructures.
Google’s Kubernetes is an open-source system for automating containerised application deployment, scaling, and management. It groups containers that make up an application into logical units for easy management and discovery. Kubernetes builds upon 15 years of experience running production workloads at Google, combined with best-of-breed ideas and practices from the community.
The IBM Spectrum LSF (load sharing facility) software is an enterprise-class software designed to distribute work across existing heterogeneous IT resources. This creates a shared, scalable, and fault-tolerant infrastructure, that delivers faster, more reliable workload performance and reduces cost. LSF balances load and allocates resources, and provides access to those resources.
LSF provides a resource management framework that takes your job requirements, finds the best resources to run the job, and monitors its progress. Jobs always run according to host load and site policies.
Slurm is an open-source, fault-tolerant and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained.
As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
The IBM Spectrum LSF software is designed to distribute work across existing heterogeneous IT resources to create a shared, scalable, fault-tolerant infrastructure that delivers faster, more reliable workload performance and reduces cost. LSF provides a resource management framework that takes your job requirements, finds the best resources to run the job, and monitors its progress. Jobs always run according to host load and site policies.