High-performance computing can help pharmaceutical companies cut the time they spend on research and development, but as Richard Holland of the Pistoia Alliance points out, HPC is not a panacea and its deployment needs to be nuanced and appropriate.
The challenges facing informatics systems in the pharmaceutical industry’s R&D laboratories are changing. The number of large-scale computational problems in the life sciences is growing and they will need more high-performance solutions than their predecessors. But the response has to be nuanced: HPC is not a cure-all for the computational problems of pharma R&D. Some applications are better suited to the use of HPC than others and its deployment needs to be considered right at the beginning of experimental design.
With this in mind, the Pistoia Alliance hosted a webinar at the beginning of October on opportunities for high-performance computing (HPC) in pharmaceutical R&D. The Pistoia Alliance is a global, not-for-profit alliance of life science companies, vendors, publishers, and academic groups that work together to lower barriers to innovation in R&D. We bring together the key constituents to identify the root causes of inefficiencies in R&D, through pre-competitive collaboration.
Professor Peter Coveney of University College London chaired the webinar and his key point -- backed by the other speakers, Matt Gianni of Cray and Darren Green of GSK – was that for HPC to be useful in pharma R&D it should produce results that are rapid, accurate, and reproducible. To illustrate the potential impact of well-designed HPC solutions, Coveney described his group’s work on an improved virtual screening tool for predicting binding affinities of compounds with target proteins that could run in only ten hours on an HPC facility with 10,000 cores, a result that would have been unfeasibly slow to achieve using more traditional computing resources.
During the discussion, Gianni proposed that an explosion in data volume, variety, and complexity is driving the need for HPC. It is an observation that will be familiar to anyone working in pharma R&D informatics today. Gianni predicted that the growing number of large-scale computational problems in the life sciences will need more nuanced and high-performance solutions than has been the case hitherto. Such problems include rapid and accurate genome assembly, particularly for more complex high-repeat or polyploid genomes, real-time analysis of genomic data, and the text-mining and contextual annotation of vast quantities of unstructured data such as health records.
As with all technological advances, it is important to consider at the very beginning of the experimental design process, the value that the technology can add and the results it makes possible, rather than considering it only as an analytical after-thought once the data has already been generated. Just as there was a habit of ‘sequence first, ask questions later’ when next-generation sequencing (NGS) first became mainstream, it is tempting to see HPC as a cure-all just waiting for data to be thrown at it, but this approach will rarely produce good quality results. For HPC to be truly valuable to pharma R&D, its strengths and weaknesses must be considered and designed into the very core of laboratory experiments that will later feed it with data to analyse.
HPC for advanced, embarrassingly parallel, text-mining
Having said that, plenty of data already exists that was never intended to be analysed in an HPC setting, or indeed in any computational setting at all. Health records, for instance, are created by doctors primarily for their own internal use, as a means of recording a patient’s progress through the medical system, yet they contain a wealth of data that could be valuable in the development of new treatments. The existing body of data is massively diverse and, if stored electronically at all, conforms to no single standard or quality specification. To make sense of this in a timely and reproducible manner requires highly advanced, embarrassingly parallel text-mining techniques with plenty of contextual analysis and natural-language processing. While not at all trivial to design and implement, this is the kind of task that HPC is very well suited to.
Genomics – data movement, not HPC?
Genomic analysis is less easily adapted to HPC, although certain areas work well. Gianni described an example where gene-gene interaction analysis for a set of 36,000 genes was reduced to 20 minutes using a Cray HPC facility compared to 25 days on more traditional infrastructure, a task made possible by the fact that the data only needed to be cross-referenced against itself. Many genomic analyses are more complex than that, requiring the import of multiple large-scale external datasets so that newly generated private data can be compared against, or integrated with results from the public record, collaborators, and research partners. When dealing with such volumes of data the sheer logistics of moving it around and ensuring it is accessible to the right cores at the right times can outweigh any issues around the scalability of the computational power itself. Thus CPU core performance and parallelisation is not always the limiting factor, and in genomics, in particular, this is frequently the case.
HPC for late-stage interpretation
Much of these genomics-specific limitations apply at a stage of analysis well before any in-depth interpretation occurs. Triaging the incoming flow of data and transforming it into a simplified set of annotated results for later analysis is work that is necessary but uninspiring in its nature and is not necessarily something that HPC can help with. Better data storage and transfer solutions would possibly have a greater impact. However, where HPC comes into its own is the late-stage interpretation. The ability to access massive computational power to deliver near-real-time analysis of the annotated data will enhance the ability of researchers to create, justify, and refine hypotheses over greatly reduced periods of time, allowing them to rule out any ineffective and potentially expensive lines of investigation at a much earlier stage.
On the one hand the computational scale and efficiency offered by HPC allows researchers to design highly effective experiments that need not be concerned with the coping ability of their informatics infrastructure. On the other hand, using HPC to mine and make better use of the data we already have may possibly be a better use of the technology than generating ever more new data to feed the computational beast.
Richard Holland is Executive Director, Operations, at the Pistoia Alliance.