The Pistoia Alliance has announced the launch of the second phase of its DataFAIRy: Bioassay project, which aims to convert bioassay data into machine-readable formats that adhere to the FAIR guiding principles of Findable, Accessible, Interoperable and Reusable.
The current pilot phase has been sponsored by AstraZeneca, Bristol Myers Squib, Novartis and Roche, and has successfully annotated 496 assays using a Natural Language Processing model that has been custom-built to recognise life sciences language. This second phase aims to scale the annotation process by 10 to 100-fold, and eventually promote the data model to become the industry standard.
Dr Vladimir Makarov, project manager of The Pistoia Alliance AI and ML Centre of Excellence comments: ‘For the duration of my career, which has spanned the last thirty years, unstructured data has been a major problem for scientists. As the volume, variety and complexity of assay information continues to increase, organisations must manage their data more effectively, so that researchers can make the most out of their time and organisations can fully realise the benefits of digital transformation. The DataFAIRY model we have developed will not only reduce the time bench scientists spend searching for assay information. It may also allow them to skip experiments known to have failed in the past. In turn, this will decrease the costs for companies and accelerate vital research.’
Biological assays are analytical methods that are crucial for testing compounds being considered for new drugs, as well as monitoring environmental toxicity. There are currently more than 1.3 million biological assay protocols that exist in plain-text formats, such as published papers or vendor notes.
Selection and validation of assays currently require a labour intensive search, taking scientists up to 12 weeks per assay. Adhering to the DataFAIRy model will reduce the time scientists spend searching and planning assay experiments. In addition, assay metadata is a popular data type for data mining. But most of these published data and metadata are not in a form suitable for automated mining. They are partially annotated in public data banks, but the volume, depth and quality of these annotations are inadequate for addressing many current and future business questions. Yet, Gartner predicts that 85 per cent of AI projects will deliver erroneous outcomes due to data issues, for example, information not being machine-readable. Projects such as DataFAIRy are therefore crucial to AI adoption being successful in the life sciences.
Although digitalisation has made companies more aware of the importance of robust data management, the lack of industry standards is still a barrier to successful annotation and management of protocols, including assays. Adopting the FAIR principles is the first step towards enabling greater data sharing between organisations and helping scientists cope with the growing volume and complexity of data generated.
Additionally, current data models are not built to recognise scientific language so a new model must be created to automate the annotation of these valuable resources. The second stage of the DataFAIRy project will further develop a model of this kind in a community-wide collaborative way.
‘AI and Natural Language processing tools need to be built with scientific terminology in mind in order to be successful,’ continues Dr Makarov. ‘The DataFAIRy model we have built will automate the annotation process so that assays are searchable and reusable, speeding up valuable research. We hope that this model will become the community standard for the publication of new assays and for the management of existing assays across vendors, regulatory agencies, and publishers, in addition to pharma and biotech.’