The Genome Analysis Centre (TGAC) has hosted a four-day workshop, ‘Statistics for Ecology, Genetics and Genomics Using R’ to try and advance the participants’ understanding of the role of statistical modelling in the analysis of biological data.
Due to the advent of high-throughput technologies, there has been a surge in the quantity and complexity of data available for analysis within biological fields such as ecology, genetics and genomics. This data has the potential to unlock a better understanding of the world; however, insights and progress often require intricate data analysis and interpretation.
R is a programming language used in statistical computing for data analysis and the development of statistical software. The workshop, based on providing trainees with skills in ‘R’, explored an array of statistical modelling techniques, methods and concepts. The course increased the attendees’ knowledge in this area and demonstrated the benefits of applying such methods in their research. As the course progressed, the group studied a variety of complex models including polygenic, genome-wide association studies and also mixture models and hidden markov models for investigating underlying structure.
Dr Vicky Schneider, head of the 361° Division at TGAC and co-organiser of the workshop, said: ‘More than ever, equipping biologists and ecologists with the ability to handle and analyse data through powerful open-source statistical package in R is fundamental to their ability to face the challenges associated with high-throughput data. I am thrilled to be joined by Dr Tom Van Dooren with whom we first organised an R course back in Leiden more than 12 years ago. Back then, we could not foresee how popular and widely adopted by the community R would become.’
Dr Tom Van Dooren, Senior Research Fellow, Institute of Ecology and Environmental Sciences Paris, co-organiser and main tutor, added: ‘What I’m trying to get across to the participants is very simple: to not just accept what the software package is doing for you but try to explore some alternatives and, if that’s possible, to go beyond the standard pipeline. Very often you have to use a tool for your bioinformatics data that is embedded in a pipeline, but there are alternative methods where you can modify the pipeline and do something new.’
Instructor Marie Laure Martin-Magniette, Researcher at INRA and AgroParisTech, added: ‘The key point of this course is to explain the important statistical models for biological data. All software gives an answer but some answers are wrong because you haven’t put sufficient input into the software - we’re trying to explain the models and the kind of interpretations you can do to resolve this.’
‘Whenever you have to deal with a dataset, whatever the dataset, you will have some kind of statistics involved,’ said course instructor Tristan Mary-Huard, Researcher at INRA and AgroParisTech. ‘You need to ensure that what you are doing from a statistical point of view is relevant and, as you start with a biological question but end with a statistical response, you have to make sure that there is a connection between the two, For this you have to dig a little bit into the model. The good news is that most biologists attending the training are already used to doing this - we are just helping them to push themselves forward and advance their usual analysis.’