In 2014, the State of Utah Science Technology and Research (USTAR) initiative and the University of Utah Health Sciences Center established the USTAR Center for Genetic Discovery (UCGD), intending to leverage Utah’s unique resources to create a computational genomics hub in Utah. Today, the centre develops algorithms, software tools, analysis pipelines, and data management systems that enable researchers and clinicians to visualise and interpret genomic data.
Holt received his PhD from the University of Utah under Mark Yandell before taking a postdoc position at the Ontario Institute of Cancer Research in Toronto. Holt was recruited back to UCGD by Mark Yandell at the University of Utah and has been at UCGD for the last ten years. Holt has worked on a number of projects, including the development of the MAKER, a portable and easily configurable genome annotation pipeline which formed his PhD thesis project.
What enticed you to return to UCGD?
Holt: Mark Yandel, my boss. He's just a great person. And I love the work available here at the University of Utah. They have a great computational resources. For example, the Center for High Performance Computing. I have access to 1000s of CPUs. Great people and excellent storage resources for large data research. People don't realise how central the University of Utah has been in computation history.
You mention storage resources. How important are they for your work?
Holt: So much of what we do is data-limited. It's the biggest bottleneck. I have access to CPUs, but the I/O does not keep up with the CPUs. I just can't read and write the data fast enough. So having specialised fast storage is complex, and one of the issues is you can't just buy a couple of 100 gigabytes. I have to buy four petabytes at a time. So, finding the financial resources to make those large purchases every few years is difficult but also extremely important. My limiting factor is whether I can read and write the data fast enough. And it doesn't matter how many CPUs I have. They're going to sit idle if my storage doesn't keep up.
You have to educate people on the importance of storage because it's not extremely exciting to hear about storage. CPUs are exciting, GPUs, AI, that's exciting. Storage is not, but it really is the bottleneck of everything here at UCGD. So, yeah, it's very difficult. Every time we ask for money for a research project, medical genetics, etc, we always put in part of the budget for storage, and we make a very big point of explaining why it is important. Because otherwise these projects won't get done. It sounds like boring minutia, but it's the crux of everything.
What is the UCGD and how does it support genetics research?
Holt: The Utah Center for Genetic Discovery is this big academic initiative involving more than 50 researchers, scientists and developers. The three main PIs, Gabor Marth, Aaron Quinlan and Mark Yandel, are heavy hitters in bioinformatics. Mark developed MAKER as well as other tools that are heavily used for genome annotation and human disease research. Gabor Marth created Freebayes, one of the first-ever variant callers. Aaron Quinlan, developed BEDTools. Everybody uses bed tools, and they also have several other tools they develop.
UCGD was created to share the software and software tools being developed by the centre with the wider academic community.
We take the research coming out of these labs and package it with our computational expertise because we're also part-time employees of the labs and have the computational resources. Collaborators have an easy location where they can go and say, “I want to use MAKER to annotate a genome” or “I want to use these other tools to explore human disease.”
They can come to the centre, and we have the computational background, the tools we help develop, and the computational resources the university provides. We can get these collaborations going quickly based on a simple recharge model. If a collaborator only needs four hours of help, we can charge you four hours. If you need 25% of employees' effort for three years, we can do that, too.
Who is a typical UCGD Core collaborator?
Holt: It's a mixture of both. It's primarily academic, but we also work with commercial collaborators as well. One of the projects I can point to is the University of Utah has the Utah NeoSeq Project. What they do is that sick children in the NICU are more likely to have a genetic basis for their condition. So we sequence the kids, and then we do the analysis through UCUD core. And we see if we can identify genetic reasons but underlying their condition, and we return that back to the clinicians, and it directly goes back into the patient's treatment to see if we can identify something that's clinically actionable.
We have a projects like that through the University of Utah, we also have collaborations with Intermountain Health Network, where we're doing similar work with children in the NICU. We also work with the Undiagnosed Disease Network and Penelope Project through Dr Lorenzo Botto at the University of Utah. These clinicians didn't have the bioinformatics expertise to do this part of their project, but they had the patients, the data sets, and the medical know-how. So we collaborate with them and provide this on a recharge basis. So those were sort of a selection of three different projects, but at the same time, also do genome annotation. I annotated the Asian and African elephant genome in collaboration with labs at the University of Utah. Any kind of significant bioinformatics research, we will help you where we can.
Why does UCGD develop its own software tools?
We help develop a lot of these tools. I help develop MAKER. Barry Moore is one of the analysts in our group. He identified the first ever human disease identified by next-generation sequencing. It was the first time a new disease had been identified using VAAST, which was developed in the Yandel Lab. He commonly uses the successor to that tool, which is a Viqgem, to analyse human disease data sets to see if he can identify the potential cause, and that's a tool we developed at the University of Utah.
We have many of those types of tools. There's an active back-and-forth where we use the tools we've previously developed and the data sets we get to create new tools. Once those tools are developed, we can offer them as services through UCGD Core.
Can you give me an example of how these tools evolve?
Holt: Okay, so an example would be VAAST. This software uses a burden test to identify the probable cause for genetic disease, but that might not give us enough statistical power in all situations, so we have a new version called VAAST 2, pVAAST (pedigree Variant Annotation, Analysis & Search Tool), that now uses pedigree-based information.
So we take the info based on the family to see how the disease separates, goes through the generations and in conjunction with the burden test, and now we have more statistical power. Then on top of that, we've developed prior probabilities based off of a connection network where we look at phenotypic and Gene annotation, saying, for example, if there's a GO term associated with a gene, but there are also other GO terms associated with it, we can look at the connections between multiple genes in the network and develop a probability network that where we when we identify a gene with a specific phenotype, we're able to propagate probabilities across that network to identify new genes that may be associated just based off of that network connection. And we found that that increased the power of identifying disease genes, and that was a tool called PHEVOR that they came out with a while ago.
These tools build on top of each other. As part of a collaboration, we might identify that a tool works much better based on the work we're doing if we were to add something. So we develop a new tool, find out it works better, and then we can take on new collaborations to do similar work.
Carson Holt is the Director of UCGD Core part of the UCGD facility at the University of Utah