Tom Wilkie reports on two examples how the growth of scientific data sets is driving computing into the cloud, and asks how profoundly this will change computing for science
At the beginning of January, researchers in the life sciences gathered at the Wellcome Trust in London to hear the first results of data analysis carried out using a new private academic cloud, set up by eMedLab, a consortium of seven academic research institutions. On the other side of the Atlantic, a couple of months earlier, the US National Science Foundation (NSF) awarded a five-year, $5 million grant to the Aristotle Cloud Federation to federate the private academic clouds of three, geographically dispersed universities.
Both the private and the federated clouds are attempts to solve the same two scientific problems: how can academic institutions with limited budgets afford the compute power necessary to analyse the huge data sets produced by modern science; and how can they share these data sets efficiently, without having to replicate them many times?
The two projects exemplify the recent surge of interest in high-performance cloud computing, as described in the feature article in the Feb/March issue of Scientific Computing World: 'HPC finally climbs into the cloud'.
According to Jacky Pallas, director of research platforms at University College London, and programme manager for the eMedLab project: ‘A lot of biomedical projects are wanting to access the same core datasets. For example, the International Cancer Genome Consortium dataset is two petabytes and we don’t want to be in a situation where datasets of such large scale are duplicated and replicated across different organisations.’
Data derived from patients or volunteers for medical research is sensitive and there are legal and ethical constraints over who can have access, and over where the data physically can be held. Just moving petabytes of data around is a challenge by itself – it takes time. Pallas estimates that even with a dedicated fast connection –10Gig -- provided by the UK’s Joint Academic Network (Janet), it would still take a month to get a petabyte of data from the European Bioinformatics Institute into the eMedLab architecture. And replicating very large datasets, regardless of the issue of moving the copies, quickly becomes onerous.
Bring the compute close to the data
Part of the driver behind the creation of eMedLab was to have a petabyte data storage system very closely coupled to the compute infrastructure. In this way, Pallas explained: ‘We could hold these large data sets and have them analysed by multiple research groups asking different questions of these data’.
The reason the consortium opted for a cloud solution, rather than a straightforward HPC cluster, was, she continued, because ‘many different research groups were envisaging asking for resource to use quite different types of code and analysis pipelines, to ask different questions of the datasets’. With a cloud solution, ‘bioinformatics researchers could build their virtual machines – their preferred suite of pipelines -- on their desktops and port it into eMedLab’. Users can request as much compute as their analysis requires, up to 6,000 cores.
Strikingly similar challenges faced the Aristotle Cloud Federation, according to David Lifka, director of the Cornell University Center for Advanced Computing (CAC) and joint leader of the project: ‘Big data equals big dollars. People have to have a data management plan, and have to say how they are going to share the data and make it available. People are struggling with this. In different disciplines, notably genomics and astronomy, they are getting buried in data. They do not have a good way to share the data without just replicating it and, when you are talking about petabytes, it’s hard to replicate. If you can share the data by being able to analyse it at the source rather than move it, that’s a very cost effective model and makes it more manageable.’
Researchers in the driving seat
Part of the rationale was also to put the researchers in the driving seat: ‘We figured out that if you let academic collaborations drive the data sharing, then you need infrastructure to support that and thus a federation. You have so much data that you need to federate and share the resources across multiple institutions.’
eMedLab is consortium of research institutions mostly located in London: University College London; Queen Mary University of London; London School of Hygiene & Tropical Medicine; King’s College London; the Francis Crick Institute; the Wellcome Trust Sanger Institute; and the EMBL European Bioinformatics Institute. The cloud is physically located in a commercial data centre provider’s premises in Slough, a town to the west of London. The hardware was put together by UK-based integrator OCF, whose work extended also to the OpenStack software. ‘OCF have been brilliant in supporting the community,’ Pallas said.
The Aristotle Cloud Federation has no such compact geography but spreads from the east to the west coast of the USA: it is a joint undertaking by Cornell University (CU), the University at Buffalo (UB), and the University of California, Santa Barbara (UCSB). Each site has its own cloud infrastructure and so ‘it is truly a federation and the hardware is truly distributed,’ Lifka said.
Like eMedLab, the Aristotle federation has 10Gig connectivity, ‘and I can tell you that we are already thinking about 100 Gig connectivity in the future,’ Lifka added. The consortium uses Globus Online to move the data. In part this is because of reliability but it is also because of ease of authentication of users and access. The federation is using InCommon, a standard trust framework for US education and research that allows trustworthy shared management of access to on-line resources, as a way to authenticate users, and Globus supports that. ‘So with a single log-in, you have a standard way to move data, a standard way to authenticate each cloud, a standard way to launch your VMs, and it just becomes a matter of learning how to do that,’ Lifka said.
‘The beauty of the cloud in all of this is that, if you have an HPC cluster and you share an HPC cluster with someone else’s HPC cluster, you’re stuck with their software stack. Every time you want to change it, to customise for something you need, you have to have a fully connected graph of everyone making their cluster behave like everybody else’s cluster. It just does not scale.’ But, Lifka continued, ‘in a cloud, you just move your VM over and you have the environment you need to analyse your data.’
He stressed that the purpose was to make things easy for the researchers: ‘Our local clouds are always going to be modest, but you want to be able to make it as easy as possible to move -- you don’t want to hinder your researchers. So then you’re cherry picking: you’re giving the researchers the ability to optimise their budget; optimise their time to science; optimise the data that they have access to. Those are very difficult things to do on a standard HPC cluster.’
Differing views of the commercial cloud
Despite the similarities, there are significant differences between the two set-ups, deriving in part from geography and also from different legal constraints. In particular, these colour attitudes to ‘bursting out’ into the commercial cloud.
The original grant from the UK’s Medical Research Council focused on three disease areas; cancer; cardiovascular; and rare diseases and Pallas pointed out that the architecture of eMedLab was designed very specifically for this type of medical and bioinformatics research. ‘There is the issue around commercial cloud provider architecture – it’s very much commodity, not optimised for the kind of specialised architectures that we use across academia,’ she continued. In addition, in Europe, there are legal restraints on where the data can be held physically – essentially it has to be located in countries, and under the control of organisations, that are subject to EU data protection legislation, and that makes US-based commercial providers legally sensitive.
A further issue is the speed at which a dataset could be got into the commercial cloud and the associated pricing considerations: commercial cloud providers charge for data egress as well as the analysis. ‘Data egress charges are at present a barrier to research groups; they can be quite significant if you are moving data around,’ Pallas concluded. However, she did not rule it out completely: ‘I’m not saying we’ll never burst out into the commercial cloud, I certainly think there is a value there.’
Eucalyptus or OpenStack?
Lifka was more upbeat about the potential of the commercial cloud for science. Firstly, the Aristotle cloud has adopted Helion Eucalyptus, from Hewlett Packard Enterprise (HPE), as its software, rather than OpenStack (which is what eMedLab has gone with). The reason is that Eucalyptus is an open source implementation of Amazon Web Services’ (AWS) cloud software. Lifka said: ‘It is clear to us that Amazon is the number one public provider that people want to use and Eucalyptus is 100 per cent compatible.’
Eucalyptus fulfils the function of all cloud-enabling software by allowing users to pool compute, storage, and network resources and scale them up or down dynamically, as application workloads change. Anyone can download the software for free and build production-ready private and hybrid cloud clouds compatible with AWS APIs. Optional consulting is available from HPE.
Three tiers in the federated cloud
Lifka envisages a three-tier cloud model: ‘First you run at home; then, when home is saturated, you burst to partner sites; and then, when that saturates, you move to an NSF cloud or to Amazon.’ On the pricing issue, Lifka is clear that ‘if you can keep a resource busy, then it is cheaper to do it at home; but if you can’t, it’s better to outsource it. If you never drive to work, why buy a car to have it sitting in your drive every day? But if you drive to work every day, it’s cheaper to buy that car than rent one.’
The Aristotle project has grown out of Cornell’s early experiments with ‘a very modest-sized cloud, a complementary resource to the true HPC clusters we have at Cornell.’ But there was a realisation that even for a leading university such as Cornell there was a limit to the capital expenditure on compute resources. Capital cost could be spread if more than one institution joined together in the federated cloud but, he pointed out: ‘When people burst to Amazon, it’s because they need a lot more resources than we are going to capitalise. But we can offer a better price for the scale that we can keep busy, so we make it easy for everyone to pick the most cost effective price/performance solution.’ Metrics developed by Aristotle partners UB and UCSB will enable scientists to make informed decisions about when to use federated resources outside their institutions.
Although, as is the case for eMedLab, the NSF grant to Aristotle is focused on data intensive applications, Lifka believes there will also be a lot of opportunities for computationally intensive jobs. However, like Pallas, he accepts that the commercial cloud providers ‘are never going to adopt tightly coupled infrastructure as their core business, because that hardware carries a premium and they not going to get enough business to recover it. They’re going to go straight commodity – throw away pizza-box servers. That’s where the volume market is and that’s what they’re betting.’ However, he is upbeat about this and believes that the research community will adapt the way it does computing to fit the type of compute resources they have available.
‘Time to science’ is what matters
He draws a parallel from the last episode in the history of HPC when commodity servers replaced specialised components. ‘Go back to the 1980s when everyone was buying their favourite colour supercomputer. Then Intel came out and said “You can build a Beowulf cluster – it’ll do almost everything that Big Iron can do.” People scoffed and said: “That will never work. You have to have a Thinking Machines’ Connection Machine; or you’ve got to have an IBM SP.” Now look at where we are today. The industry drove the volume market and research adapted -- which is what research is really good at.’
Lifka sees a similar paradigm shift happening today, because what is important to researchers is ‘time to science’ not the length of time a job takes to compute. ‘If you can wait in line at a national supercomputing centre and it takes five days in the queue for your job to run, and then you get 50,000 cores and your job runs in a few hours, that’s great. But what if you could get those 50,000 cores right now, no waiting, and your job takes longer to run but it would still finish before your other job would start on the big iron machine.
‘Time to science is what matters,’ he insisted, ‘not how many cores you can use in a tightly coupled fashion. Researchers will adapt. What they care about is results; the best price; and least time in the queue.’
He is not suggesting that tightly coupled supercomputers are redundant: ‘The people that really need the high-end stuff are going to still need it and they are going to run it at national supercomputing centres. But very few academic institutions are going to be able to afford systems the size of Blue Waters or Stampede – it’s going to be Federally funded or game over.’
In Lifka’s view, this is going to force the majority of users -- those who can get by without that kind of nationally funded compute resource -- to figure out a new way of doing scientific computing. ‘For the first time, I see Administrative IT driving the trends instead of Research IT. People are containerising enterprise apps and making codes scale on demand using cloud infrastructure. As they do that, the research community is starting to catch on and to see the benefit. I think it is going to be a game changer.’
The revolution will not happen tomorrow: ‘We did not go from Intel-based laptops to Intel-based supercomputers in a day either. I think there will always be the need for big iron, but this is a complementary resource and it is going to free up the time on those big iron resources for the researchers that need it most.’ He concluded: ‘If everyone gets their time to science improved, everybody wins.’