For scientists, data is the lifeblood of research. Collecting, organising and sharing data both within and across fields drives pivotal discoveries that make us better off and more secure.
Making data open and available, however, only answers part of the question about how different scientists — often with very different training — can draw useful conclusions from the same dataset. In order to promote and guide the cultivation and exchange of data, researchers have developed a set of principles that could make the data more findable, accessible, interoperable and reusable, or FAIR, for both people and machines.
Although FAIR principles were first published in 2016, researchers are still figuring out how they apply to particular datasets. In a new study, researchers from the US Department of Energy’s (DOE) Argonne National Laboratory, Massachusetts Institute of Technology, University of California San Diego, University of Minnesota, and University of Illinois at Urbana-Champaign have laid out a set of new practices to guide the curation of high energy physics datasets that makes them more FAIR.
Argonne computational scientist Eliu Huerta, an author of the study comments: ‘The FAIR principles were created to serve as goals for data producers and publishers to improve data management and stewardship practices. The community expects that adhering to these principles will enhance the capabilities of machines to automate the finding and use of data, thereby streamlining the reuse of data for humans.’
The research, published in Nature Scientific Data, demonstrates how to FAIRify an open simulation dataset drawn from particle physics experiments at the CERN Large Hadron Collider. To highlight the interplay between artificial intelligence (AI) research and scientific visualisation, this study also provided software tools to visualise and explore this FAIR dataset.
In addition to building FAIR datasets, Huerta and his colleagues also sought to understand the FAIRness of AI models. ’To have a FAIR AI model, we believe you need to have a FAIR dataset to train it on,’ said Yifan Chen, the first author of the paper and a graduate student at Illinois and Argonne’s Data Science and Learning division. ’Applying the FAIR principles to AI models will automate and streamline the design and use of those models for scientific discovery.’
‘Our goal is to shed new light into the interplay of AI models and experimental data and help create a rigorous framework for the development of AI tools to address the biggest challenges in science,’ Huerta added. Ultimately, Huerta said, the goal of FAIRness is to create an agreed-upon set of best practices and methodologies, which will maximise the impact of AI and pave the way for the development of next-generation AI tools.
‘We’re looking at the entire discovery cycle, from data production and curation, design and deployment of smart and modern computing environments and scientific data infrastructures, and the combination of these to create AI frameworks that greatly advance our understanding of scientific phenomena,’ Huerta added.