Data collection and formats in drug discovery

Common language is critical in ensuring future decisions can be made on the basis of data collected from multiple sources. (Image: Siemens)

At the very beginning of the drug development process, when identifying targets or looking at the properties of molecules, data managers and scientists need to set parameters for recording the results of experiments.

John McGonigle, Director of Bioinformatics at Insmed, encourages his teams to leave no stone unturned. “I tend to be quite conservative when it comes to data modelling,” he says. “When you’re starting to work with a data set, it’s very difficult to know what you’ll need for subsequent data modelling. I encourage our scientists to think quite exhaustively in terms of what metadata we might want to then collapse down upon later, such as for Quality Control or data lighthousing (i.e. helping to identify the source of a data issue, such as a faulty instrument set-up).

“Once we have that raw metadata, we figure out where we’re going to store that data, be it a data warehouse in a data lake-style approach, or a relational database, for example.”

Daniel Suveges, Senior Bioinformatician, Open Targets Data Team, European Bioinformatics Institute (EBI), agrees that early data decisions are critical. “We try to formulate the data models accordingly and develop schemes to safeguard ourselves from future issues when repeated ingestion is going on,” he says.

Moritz von Stosch, Chief Innovation Officer, DataHow, agrees that breadth of data at the early stage is important. “Scientists have two opposing needs,” he says. “One is to collect all the data you need to answer the question at hand – what data do you need to make a decision; the other is the need to create the data as an ongoing asset. You want to collect data as broadly as possible, so you can reuse that data in the future. Striking that balance is really difficult, because collecting as much data as possible is a huge time investment.”

It’s not just internal data, of course – many drug discovery journeys start with data created elsewhere, as Manasa Ramakrishna, Associate, Director, Knowledge Graph – Design and Operations, AstraZeneca, points out: “With external data sources, you usually have a template. For example, Open Targets has a standard set of information that you get. We would discuss with our scientists what fields from Open Targets would be most useful, so we always have one eye on where this data will end up.

“We might then have a second layer, particularly if you’re bringing internal and external data sources together. We will insist on all data sources having minimal criteria, such as standardised identifiers. We are pedantic about it because, otherwise, there’s no common language. Of course, we assist in the standardisation process where help is needed.”

Common language is critical

That common language is critical in ensuring future decisions can be made on the basis of data collected from multiple sources. Nick Lynch, Founder, Curlew Research, says: “There’s often common minimum information and that drives most of the data model to be able to make those progression decisions. All this data is trying to support decision-making or progress along the pipeline.”

Instrumentation is a key variable in any research project. With laboratories upgrading constantly, there will always be new instruments and their associated software packages being introduced. Darren Green, Director, DesignPlus Cheminformatics Consultancy, says: “There’s always a balance between wanting to allow researchers to use new instrumentation as soon as possible and needing the data schema in place to ingest any results. The need to get and publish results in the short-term soon needs a relaxed approach, but data managers will need to impose data standards as soon as possible. It often means retrofitting metadata, but that’s the nature of fast-moving laboratory science.”

Kevin Back, Product Manager, Cambridge Crystallographic Data Centre (CCDC), adds: “With a new instrument purchase, it’s important to consider the data implications: what data that instrument will produce; how you want to store that data; and any metadata you want to capture. Additionally, you might have multiple different instruments producing similar sorts of data. So one needs to ensure it’s possible for that data to be read across formats, which isn’t always the case.

“Hopefully, the efforts of the Allotrope Foundation and other consortia to try and come up with open data formats will allow sharing between different laboratory instruments, but that work is yet to be completed.”

AstraZeneca’s Ramakrishna says raw data can be stored in such a way that one can always go back to the source: “Some data platforms, such as Databricks, use a medallion structure. At the bronze level is all the raw data; the silver layer is where you’ve made some decisions; and, finally, the gold layer is what the end user actually needs. If a different user needs something different, we can go back to the bronze layer and provide them a different gold layer.”

Data formats take a while to emerge

Green concludes that, sometimes, issues with data formats take a while to emerge: “One of the challenges you have is when you start wanting to use data outside of its original application. Often, results end up in a PDF, which is really difficult to run machine learning on. You haven’t got the data in a format that’s reusable.”

Lukas Kürten, Digital Innovation Manager, CPI, says one can put guidelines in place to help minimise ambiguity. “Using an Excel set-up for data storage as an analogy,” he says, “if you have a normal field, users can type in anything – and that can be open to interpretation. If that field is, instead, a drop-down menu, it’s clear what is meant. However, the trade-off is that there’s no means to record a note about, say, how the experiment didn’t go the way you expected. We are all in science – and science doesn’t always go the way you expect it to, particularly when it comes to manufacturing. Whatever system is implemented, it needs to strike that balance between removing ambiguity in field data, while allowing the flexibility to add context.”

Offering an industry expert view, Jim Thompson, Medical Devices and Pharmaceutical Lead, Siemens Digital Industries Software, says: “The emergence of standards, such as those from Allotrope, Pistoia Alliance or IDMP, is really key for external vendors in the laboratory space, since we want to be able to address the needs of as many customers as possible in a standardised, repeatable way. Consolidation around approaches to data is really important for the future.’

The full report is available to download as a White Paper, which also covers: Data silos and how to avoid them; Data ontologies and efficiency of process development; Cultural change and the digitisation journey; The shift to in silico for experiments; and Process optimisation and technology transfer.

[Download the White Paper here]

The roundtable and series of articles is sponsored by Siemens Digital Industries.

Data collection and formats in drug discovery

At the very beginning of the drug development process, when identifying targets or looking at the properties of molecules, data managers and scientists need to set parameters for recording the results of experiments.

Common language is critical

Data formats take a while to emerge

Topics

Read more about:

Editor's picks

Out now: The Laboratory Informatics Guide 2025

AI in Life Sciences: Practical applications in small molecule design

On-demand Webcast: Transform your labs with cutting-edge AI solutions

On-demand online panel discussion: Cloud computing in the analytical lab: The strategic risks, challenges and opportunities to consider