Today the European Bioinformatics Institute (EBI) maintains the world’s most comprehensive range of freely available and up-to-date molecular data resources. It provides 307 petabytes of raw data storage for bioinformatics data and receives more than 62 million web requests per day.
The focus on open science and delivering infrastructure to support scientists’ access to scientific data is at the heart of EBI’s mission to support bioinformatics research. The first steps to creating bioinformatics data resources so that scientists and researchers could share sequence data in Europe were in 1980, with the creation of the EMBL Nucleotide Sequence Data Library (now EMBL Bank, part of the European Nucleotide Archive). The archive was established in 1980 at European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany.
Johanna McEntyre associate director of EMBL-EBI services, senior scientist and head of literature services, said: ‘EBI was started more than 25 years ago now with a remit to provide data resources for life sciences. The original resource, the EMBL data library, housed nucleotide sequences for research. Sequencing technologies were taking off. They had moved out of being a research project in themselves, to something that more and more people were doing.
‘There was a requirement for a database to keep these things, because there is a lot of value in comparing sequences from different organisms.’
This initial EMBL data library grew and scientists realised other resources were being created that would also be of great value if they could be shared with the wider community. What began as a straightforward task of abstracting information from scientific literature soon grew to a major database, with researchers submitting data directly.
In 1992, EMBL Council voted to establish the EMBL-European Bioinformatics Institute (EMBL-EBI) and locate it on the Wellcome Trust Genome Campus in Hinxton, UK, where it would be in close proximity to the Wellcome Sanger Institute. In September 1994, EMBL-EBI was established in the UK.
‘EBI was created almost 27 years ago. The institute grew from just a few pioneers at that time to an 800-strong institute today,’ noted McEntyre.
EBI’s mission is split into five distinct areas: computational research; supporting industry use of bioinformatics data; providing training on the use of these resources; hosting the Elixir hub; and supporting scientific services and resources such as the European Nucleotide Archive (previously the EMBL Data Library).
Open access infrastructure supports Covid-19 research
Another aspect related to this open access data is research paper preprints, which have been very valuable for sharing data during the pandemic. A preprint allows researchers to share results with the scientific community in advance of peer review, making data available much faster than was previously possible.
‘Another very recent thing that we have done is a project based on Covid-19 preprints,’ said McEntyre. ‘Instead of being behind closed doors you just post your finished manuscript to a preprint server. When you do that, what happens is a very light screening process. That means that your results are available in 48 hours of submitting it, as opposed to months.
‘During the pandemic pre-prints have been very important in very quickly sharing results. The model is that it doesn’t avoid peer review, it’s just that peer review happens after the fact,’ McEntyre added.
Need for open access research data
Today EBI has a huge collection of tools and data resources, including Europe PMC, a full-text database of research articles and abstracts that are openly available for everyone to read, with a subset of those available for reuse.
‘The point of running this database is because funders of life science research in the UK needed some infrastructure to support their open-access policies. Very simplistically put, their policies will say something along the lines of “we expect all our researchers to publish the outcomes of their research openly”,’ said McEntyre. ‘That will often specify that these papers should have a CC-BY licence, a licence that allows people to reuse that work without having to request permission – this can be very important for machine learning and AI,’
As the data volumes and types of resources increased, some of the organisations funding research were trying to find a way to make results openly available. ‘A lot of research funders, not just EBI, needed a repository to support that open access policy. They invented what was called UK PMC at the time, which has also now grown to include 30 funders across Europe, and so it is called Europe PMC,’ said McEntyre.
Europe PMC is a global, free, biomedical literature repository, providing access to worldwide life sciences articles, preprints, micropublications, books, patents and clinical guidelines. The resource currently contains more than 36 million abstracts and more than five million full-text articles. A subset of the full-text information corpus is the open access literature that can be downloaded and used from the FTP site, for example for text-mining research.
‘This database was originally created to support those open access policies introduced by research funders, but it does lots of other things as well. One of the most important things is to provide bulk downloads and APIs for programmatic access to the content,’ said McEntyre.
‘The second very important thing is linking to the data. When someone deposits data into one of our databases, it is usually because they have generated the data in the course of doing some experiments, and typically those experiments will be written up in the form of a research paper. You want to link the data to the literature for the biological context, and you want to link from the paper to the data, to show the provenance of the research results,’ McEntyre said.
‘This means that you can look at the data behind the paper and see for yourself whether it supports the assertions that have been made in the research paper. That linking between the literature and the data is very important in both directions.’
Open access to the data generally improves the reproducibility of the science, as more people can see the outputs and access the contextual data that supports the research paper. However, linking the data is also very important for research, and trust in that research.
‘Coming from the data point of view, you might argue that linking to the data is less important, but where that really comes into its own is where people have developed algorithms to search through the data,’ stressed McEntyre. ‘I think, and many other people think, that linking to the data is very important, because you need to have as big a collection as possible of open access literature for people to invent new ways of doing things. For example, searching or browsing literature.’
In order for AI and machine learning to be possible, data needs to be made available, and in a way that allows scientists to generate huge datasets. Creating a large database of open data enables researches to access data from several different sources that have all been stored and made available for reuse.
If data is not managed in the correct way, these types of activities are made much harder, as individuals would need to seek out that data manually from several different sources. This data may be stored using different formats, making data cleansing and pre-processing necessary. ‘Managing in the correct way means open data, it also means storing in formats that can readily be consumed by machines and humans, and having the appropriate amount of metadata,’ said McEntyre.
Open science infrastructure can open up new ways of conducting research and interrogating data, but ultimately the driving factor of open access research is to drive scientific discovery by sharing insights with other researchers.
‘Open science is a great equaliser. It is far more likely that someone in a developing country is going to have an internet connection, than access to a library of scientific data.
‘It opens up research to everybody in the world,’ said McEntyre.