Rob Lalonde, Univa’s cloud VP general manager, considers the unique challenges posed by HPC.
To say that cloud computing is enjoying rapid adoption is an understatement. Gartner projects that annual spending on cloud Infrastrucutre as a service (IaaS) will grow 27.6 per cent to reach $39.5bn in 2019, making IaaS the fastest-growing segment of cloud revenue. While HPC has been a relative laggard when it comes to cloud adoption, this is changing fast.
According to a research note from Hyperion in March, 74 per cent of worldwide HPC sites now run at least some workloads in public cloud – a five-fold increase from just a few years ago. While the use of cloud is still modest, accounting for less than 10 per cent of HPC workloads in 2018, HPC in the cloud appears poised for rapid growth. Factors propelling adoption include increasingly capable HPC services from cloud service providers (CSPs), the desire to shift Capex to Opex and widespread use of machine learning (ML) and High-Performance Data Analytics (HPDA) often needing specialised resources.
Costs is a bigger challenge for HPC users
While some perceive cloud computing as less expensive, HPC managers are under no such illusions. Opinions vary, but Univa customers estimate that cloud computing is between four and nine times the cost of on-premise capacity if not properly managed. Cloud computing often makes sense, but given the cost premium, it needs to be carefully managed.
HPC sites differ from their enterprise counterparts in two important respects. First, most HPC sites are relatively efficient. They are mature in their use of workload managers, and tend to squeeze every ounce of performance from expensive on-premise hardware. Gartner estimates that typical data centre utilisation is around 18 per cent, but in our experience, HPC centres typically have utilisation rates in the range of 70 to 80 per cent or higher.
A second difference is in the scale of HPC workloads. In a recent well-publicised project, Western Digital ran a large-scale physics simulation on AWS involving 40,000 spot instances and more than a million vCPUs. While the business results were impressive (a 60 times reduction in elapsed time) consuming cloud resources at this scale costs approximately $17,000 per hour. The business need may justify the expense, but it’s easy to see how costs can get out of control. Most HPC users have an insatiable appetite for compute cycles.
Effective management is essential to containing cost
According to InfoWorld, as much as 35 per cent of cloud spending is wasted. The ease with which cloud services are consumed can cause costs to get out of control. Users frequently over-provision compute instances and storage, or start cloud services and forget them. Tiered pricing schemes, bandwidth-related costs, and fluid rate structures that vary by region make costs in the cloud exceedingly difficult to manage.
Complicating things, multi-cloud deployments are a fact of life. They arise organically through mergers and acquisitions, collaborations with third parties, SaaS or PaaS offerings unique to specific CSPs, and LOBs making independent decisions.
Gartner estimates that 80 per cent of IaaS users will overshoot their budgets, largely because most organisations lack necessary internal process controls to deal with costs in the cloud.
You can’t manage what you can’t measure
To deal with these challenges, HPC managers are putting more rigorous policies in place for cloud usage and using cloud-specific tools to monitor and report on resource consumption. While most CSPs offer tools to help manage spending against budgets, each CSP has their reporting system. Some clients are embracing Cloud Service Expense Management systems to aggregate and track costs across clouds, but by the time an alert is received, cost over-runs can be substantial.
Beyond expense monitoring to cloud automation
Costs in the cloud are tied directly to consumption, so managing access to cloud resources and maximising utilisation is the key to minimising expenses. Effectively controlling cloud spending requires more than just expense monitoring – it requires cloud automation tools that are tightly integrated with HPC workload managers. The latest generation of cloud automation tools are application, resource, and budget-aware, and can adjust workload and resource deployments on-the-fly across multiple clouds considering planned spending by project, on-premise capacity, and actual vs planned resource consumption.
By interacting with the workload manager and obtaining metrics directly from multiple CSPs, cloud automation tools can shutdown idle services, right-size instance selection, automate data movement to minimise storage and network costs, and take a variety of other actions automatically to maximise efficiency, while keeping a tight reign on cloud spending. With cloud taking a bigger bite out of IT budgets, finding ways to reign in spending is a growing concern. Fortunately, new tools are on the way that go beyond simple expense management and apply policy-driven cloud automation to reduce costs.
References:
Gartner Forecasts Worldwide Public Cloud Revenue
Hyperion Research - Cloud Computing for HPC Comes of Age
Western Digital HDD simulation at cloud scale
35 per cent of cloud spending is wasted
How to Identify Solutions for Managing Costs in Public Cloud IaaS