The confluence of data, compute power and advances in the design of algorithms for AI (artificial intelligence) and ML (machine learning) are driving new approaches in the laboratory. This gives scientists access to additional tools that can open new avenues for research or accelerate existing workflows.
The increasing interest in AI and ML is driving software companies to examine how they can develop their own software frameworks, or integrate functionality into existing laboratory software to support laboratory scientists’ use of AI.
Some examples of domain areas that are already seeing benefits of early AI adoption include predictive maintenance of instruments; predicting efficacy and potential of small molecules for drug discovery; and image analysis for a variety of different use cases such as crystallography and medical imaging.
Stephen Hayward, product marketing manager at Biovia, Dassault Systèmes highlights the steps the company has taken to integrate AI functionality into its software: ‘We have a product called Biovia Pipeline Pilot, which is all about data science, data preparation, connecting data sources together and performing various functions on it. When we talk about machine learning and AI, it’s Pipeline Pilot that is core to that.’
As the adoption of AI and ML techniques becomes more widespread the techniques are beginning to transform how scientific research is conducted. However, organisations need to ensure their teams are focused on their scientific goals rather than trying to develop expertise in advanced computational methods. While it is true that there should be some staff with a good understanding of AI and the software frameworks they are using to build these intelligent systems, it is unreasonable to think that specialised domain expert lab scientists should be expected to develop skills in computer science or the development of AI frameworks.
This is driving software companies such as Biovia and Dotmatics to develop applications that can support AI and ML while also obfuscating some of the complexity so that domain scientists can make use of the technology with limited support from AI experts inside their own organisation.
Pipeline Pilot, for example, allows data scientists to train models with only a few clicks, compare the performance of model types and save trained models for future use. However, expert users can also embed custom scripts from Python, Perl or R to maximise their use across the organisation. ‘Pipeline Pilot ‘is, as the name implies, a data pipelining tool,’ says Hayward. ‘So it is able to open up many different formats of data and perform different functions on it. Sometimes it can be used for cleaning data or transforming data into different formats, or performing statistical operations on it.’
‘One of the key features of it is that it’s sort of a visual tool so you build literal pipelines of data and you can see what each step is going to be performed along this process that you put together,’ Hayward continued. ‘That in turn becomes what’s known as a Pipeline Pilot Protocol. So where you’re starting with a certain data format and one location, you’re performing a variety of functions on it, and then you’re outputting it somewhere else, whether that’s a new data file, or whether it’s pushing it into a different system at the end of the protocol.’
Since every model is tied to a protocol, organisations have insight into where the data comes from, how it is cleaned and what models generate the results. With the demand for custom data science solutions increasing, software developers need ways to streamline protocol creation. Pipeline Pilot wraps complex functions in simple drag-and-drop components that can be strung into a workflow. These protocols can be shared between users and groups for reuse, ensuring that solutions are developed faster and standardised.
The software enables scientists to start using AI and ML by making use of built-in AI and ML models. These can be utilised to run scientific calculations and analyses from multiple data sources including; Image data, spectral data, DNA/RNA/protein sequences, chemistry, text, streaming (IoT/IoE), financial records and location data.
Pipeline Pilot enables users to automate model building using more than 30 supervised and unsupervised machine learning algorithms, including: random forest, XGBoost, neural networks, linear regression, support vector machines, principle component analysis (PCA), genetic function approximation (GFA).
It is well understood in AI that the more data that can be used to train AI models the more accurate they will become. This means that it is imperative that organisations can access large data sets within their own organisation, or get access to additional data from other sources. However, this also creates additional challenges, as data needs to be standardised, cleaned and transformed and comparable in order for the model to generate insights.
Dotmatics software services are used throughout the industry to help research organisations keep on top of their data. Dotmatics principal consultant Dan Ormsby, explains how the company’s software helps them develop artificial intelligence (AI) solutions. ‘Our customers are already using our services to query and report on all project data across disparate sources. The next natural step is to add an AI layer onto that.’
‘If you want to use artificial intelligence on your data, you must arrange the data, and do any necessary gap-finding or aggregation,‘ Ormsby continues, ‘For example, if you've got an ID of a compound that you've produced, then you need to be able to link that ID to other information, such as who made it, which bottles contain batches that have been tested in assays, if they're active against a particular biological target, and so on. This is already accomplished by our software.’
Feature engineering
Generating and transforming data to be AI-ready is just one step in the process. As Ormsby explains, applying domain knowledge to data allows their algorithms to only focus on important features in your data, increasing the chance of meaningful and interpretable results.
‘I’ve started my feature engineering work by focusing on small molecule drug discovery, such as lead optimisation. Dotmatics has many customers with workflows in that area, so you can imagine the level of domain knowledge we can help each to bring to their data’ noted Ormsby.
‘Every customer’s existing data is already a training set. There are timestamps on all submitted compounds, data, so we can see if compounds produced in the first month are predictive of the second month, then take the first two months and see if these are predictive of the third month, and so on, gradually training on a larger set to retrospectively train the model,’ states Ormsby.
‘Then every customer’s data has a modelability metric calculated so users can see how the modelability increased over the course of the project to present. If you find that past data were predictive, then you know that that the model is finding a signal in there which gives you confidence in it’.
This provides organisations with detailed information on their project and its goals that they can use to see if the work that has been done previously is predictive of work done later in the project. This can help an organisation determine whether project goals have been met and how achievable they might have been based on the actual progress achieved.
‘If you find that it was predictive, then you know that that the model is finding an actual signal in there, there is something modelable, and it gives you more confidence to believe in it,’ said Ormsby.
Applications for early AI adoption The explosion of AI and ML is creating a wide range of new avenues for scientists and researchers to find insight in scientific data but these are still fairly new ways of solving problems and the number of use cases continues to expand.
While that is the high level of how it works, what scientists are doing with this software is a very open-ended question,’ noted Biovia’s Hayward. ‘You can do different statistical analyses, and apply it to different machine learning problems. We’ve used it in the lab for different things like predictive maintenance of equipment. Another example that we often give is talking about image analysis.’
Predictive maintenance can provide information about the use patterns, types of samples and experiments being run on the instrument, support and maintenance and analyse those data points. The resulting data can provide detailed information about when maintenance and service actions need to take place for each instrument or group of instruments based on their specific parameters.
Hayward also pointed out the potential for image analysis and how this has been applied to images of crystal structures. ‘When you’re analysing crystal structures, and trying to identify structures, what components you’re seeing and the attributes of the crystals that you’re looking at.’ He noted that this type of analysis can be very time-consuming and so effectively automating the steps to interpret large sets of images can provide huge benefits to laboratory efficiency. ‘But then the question is: what parameters Do I need to look at here? How does a computer see this image in a digital way? How do I describe it in digital terms? Said Hayward.’
Once you have a good idea of what features you are looking for and how to describe those in digital terms you can begin to leverage AI to find answers to some complex questions.
‘That’s where you can start training the system using a set of training images. The model can build out its own set of algorithms and come to the conclusion of how to sort out these images, and then you can apply it in that way going forward,’ noted Hayward. ‘By leveraging AI for these types of tasks, users can save huge amounts of time once the model has been trained.’
One of the main challenges facing scientists and research organisations that want to make use of AI is ensuring that their data is appropriately stored and organised to ensure that it can be used for AI and ML research without huge amounts of manual work that needs to be done before a project can start.
If organisations take the time now to ensure that data is ready to be used in AI in the future, then they can save huge amounts of time further down the road when AI and ML become a prerequisite to laboratory success.
‘It’s a big challenge,’ notes Hayward. ‘That is what we’ve been seeing with our customers – they are often trying to standardise data so that they can compare it across different labs and make sure that it’s consistent.
‘You want to make sure that the data that you’re putting into the system can be opened up later and compared across the different locations. So regardless of where an experiment is taking place, you can look at it, see these two data sets, compare them and know that that’s an accurate representation,’ Hayward added. ‘For anybody that’s implementing a new platform, that’s a big concern to them, because they don’t want to get locked into just one thing, and then their old data is inaccessible or incomparable.’
Ormsby further stresses this point, noting that organisations will need to adopt AI to stay competitive. ‘You're either doing machine learning, or you're going to be out-competed in the next few years. You have to do this now just to have a chance of staying in business. People who fully embrace AI assistance just win.’