An international research consortium is to embark on the most ambitious genomics program to date, with the task of sequencing 1,000 genomes within three years. The data from the project will be freely available online, providing a reference point for researchers investigating the genetic factors involved in disorders such as heart disease, cancer and schizophrenia.
The research will provide a better understanding of the way our genes differ from person to person. Most of our gene sequences are very similar, but subtle variations between them can lead to important effects. By comparing the different sequences scientists have been able to find these areas of the genome that contribute to the likelihood of developing a disease and the efficacy of different therapies.
Currently, only a handful of human genomes have been sequenced. This has been sufficient to find areas of the genome that commonly vary from person to person, but so far the sample of subjects has not been big enough to find variations that occur in only a small fraction of the population.
‘We have an incomplete map of variation in our genome at the moment – it’s a bit hit and miss,’ Dr Richard Durbin from the Sanger Institute, a member of the consortium, told scientific-computing.com. ‘The 1000 Genomes Project will provide a much better picture.’
The consortium, which includes the Wellcome Trust Sanger Institute in Cambridge, UK, the Beijing Genomics Institute in China and the US National Human Genome Research Institute, hopes to solve this by sequencing enough genomes to be able to track genetic variations that occur in just one per cent of the global population.
The project presents an enormous bioinformatics task: over its three-year period, six trillion DNA bases will be recorded, representing more genetics data than has been collected in the past 25 years. To solve this, the consortium is developing innovative ways to cope with this data.
Part of the solution is the Sanger Institute’s ‘compute farm’, which contains a large number of processors and more than 300TB of memory to process and store the large volumes of data. The institute is also developing software to compare the new data with older versions of the genome, to analyse whether the differences are true variations between individuals or errors in the sequencing process. In addition, the software will use statistical techniques to find whether the presence or absence of certain variations has effects on other genes in the genome.