Robert Roe explores the role of maintenance in ensuring HPC systems run at optimal performance.
Employing the correct maintenance procedures can reduce the downtime of HPC systems and help to predict the failure of systems or components. Increasing the time that a computing service is available also increases the amount of scientific output that a HPC system can generate.
Russel Slack, operations director at OCF, explains: ‘From my perspective and the perspective of OCF, maintenance is about ensuring service availability, so it is preventative and proactive.’
‘Most of the customer feedback, when there are problems, is based on the service being down. System admins realise that it is not their shiny toy in the server room for them to log in to and administer – they are acutely aware that users are the important people here, so you have to focus on service availability for the user community.’
Ensuring efficient use of a supercomputer is necessary not only to support the user community but also to generate sufficient return on investment for the organisation or enterprise that funds the HPC system.
If a system is taken down unexpectedly there will be a backlog of jobs – but also additional costs in running the datacentre, and getting the service fixed in a timely manner. These costs will continue to mount up with no return on investment until the service can be resumed.
‘If you are running a system for years and then it falls over because you have not patched, then lots of questions will be asked of the system administration team,’ said Slack.
There is no one size fits all
OCF offers different options for maintenance contracts which are built into a service level agreement (SLA) prior to the installation of the system and tailored to a customer’s needs. ‘We will speak to the customer about their technical capability within the team and their requirements for uptime. Is it a research tool or a production tool?’
‘This allows us to get a feel for their ability to manage and maintain the system properly, and also the general expectation of the user community,’ commented Slack.
The SLA is then based on several different options for maintenance and support. Frontline support focuses on the replacement of failed hardware – common items that might fail over time such as memory DIMMs or fans and other small components.
‘We will then put together an element of software support that is break-fix based on the software stack that we deliver on the service,’ said Slack. ‘If a certain piece of software has broken, for whatever reason, then we will dial in and work on that to resolve it for them – and then we tend to add some other options, based on their requirements.’
‘These requirements might include remote monitoring at a frequency that has been decided between the client and us – it could be daily, weekly or once a month. We will perform a sweep of the service looking at the logs and the hardware to check for things are looking healthy and working optimally,’ said Slack.
This information is then gathered and fed back to the admin team in the form of reports. Reports are provided by OCF at specific, predetermined intervals to state that the system has a clean bill of health – which provides some piece of mind that the service will run as expected.
This is packaged into suitable time and maintenance windows, based on the needs of the customer. ‘We offer service credits that can be used in service windows, and this is where the service may be taken offline for a period to do some preventative maintenance,’ explained Slack.
‘It could be that RedHat has released a kernel update that has a security fix in it. The HPC guys have decided that they want to put it on the system, so they can use a portion of the service credits that have been bundled into the SLA – so we will dial in and roll out that kernel patch with them.’
‘It might be that we have discussed it and decide that this is a serious piece of work – we are going to take the whole platform down, and then we will come on site to spend a few days with the customer during this maintenance window. Together as a team we will deploy these fixes or apply upgrades – whatever it might be.
‘This kind of work would usually be decided each quarter, but it can vary based on an organisation. OCF can offer anything from HPC hardware to a fully managed service, so it is important to understand the customers’ needs and manage expectations accordingly.
‘Some customers want a HPC system, but they only have users and not an admin team. We will act as a virtual system administrator and an on-site admin that can do everything,’ added Slack.
Finding a balance
As with so many aspects of HPC, it is about finding a balance – in this case, between the technical staff available to manage a service and the number of users. ‘We speak with a customer before we deliver a solution, to try and find out where they might have holes in their expertise or staff. We do not want to add extra work or extra wheels to this system that might slow them down,’ explained Slack.
‘If they have got a good team of technical staff and they are ‘au fait’ with how to do this, then we can step back and wait for them to request something. We can be as hands on as they want us to be,’ said Slack.
Preventative maintenance helps HPC operators avoid some of the downtime from failed components, but in some cases this can be unavoidable. In these situations it is important that there are proper steps in place to mitigate lost time and recover data.
While individual, random component failures are a potential issue for any system, one thing that cannot be allowed is a loss of power, which would require a HPC system to shut down and abandon any jobs currently running. To avoid this, supercomputers will rely on an uninterrupted power supply (UPS), which ensures power to the computing elements.
UPS systems also help to reduce the necessity of maintenance by preventing power spikes and noise from reaching the delicate systems – as Leo Craig, general manager at Riello UPS comments.
‘A UPS is really an insurance policy against a number of potential problems. The first main thing is to ensure against power outages on the grid or locally so that the HPC system is unaffected. It also protects the main supply from things like surges, spikes, high-frequency noise and frequency variation.’
‘Things like spikes and high-frequency noise can actually shorten the lifespan of computing equipment, because the power supply and the circuits are completely bombarded by this noise. It might not cause an instantaneous failure but it will build up over time and then you will end up with an infantile failure on servers,’ added Craig.
Something as simple as flicking a light switch can cause a small power spike that will travel through the mains supply of a building, diminishing as it travels. However, if it was to hit a sensitive piece of equipment it might cause damage.
‘If we look at the other end where the national grid is doing what is called ‘grid switching’ to manage the power that is sent to computer facilities,’ said Craig.
‘That is becoming more prevalent because of the introduction of renewable energy into the grid. Instead of having a few large power stations we have got lots of smaller generators like wind or solar farms.’
‘Renewable energy is causing a problem for the national grid in terms of much more frequent grid switching and that grid switching creates noise and spikes which is damaging to computing equipment.’
Craig also noted that UPS systems may begin to play a more important role in the future as stricter regulations are brought in on the use of diesel generators.
‘They are big engines emitting large amounts of CO2 and pollutants.
While the latest ones meet current EU standards, the generators installed in datacentres might not be new – they could be five to 10 years old,’ concluded Craig.