IBM Research has recently announced that its Deep Search toolkit has now been released as open source. Deep Search allows scientists and businesses unstructured data. The organisation has now released Deep Search for Scientific Discovery (DS4SD) making the toolkit more versatile and accessible.
Following the launch of the Generative Toolkit for Scientific Discovery (GT4SD) in March, the availability of DS4SD marks the next progression towards building an Open Science Hub for Accelerated Discovery.
To help achieve this goal, IBM choose to publicly release a key component of the Deep Search Experience, its automatic document conversion service. It allows users to upload documents to inspect a document’s conversion quality. DS4SD has a simple drag-and-drop interface, making it very easy for non-experts. IBM also released deepsearch-toolkit, a Python package, where users can programmatically upload and convert documents in bulk. Users can point to a folder and direct the toolkit to upload the documents, convert them, and ultimately analyse the contents of the text, tables, and figures.
The new toolkit interacts and integrates with existing services, and is available to data scientists and engineers through our Python package.
There is a lot of value in unstructured data for scientific research. Consider IBM’s Project Photoresist, for example: IBM used Deep Search in 2020 to find and synthesise a novel photoacid generator molecule for semiconductor manufacturing. These generators pose environmental risks and IBM wanted to discover a better option. Deep Search can ingest data up to 1,000 times faster and screen the data up to 100 times faster than a manual alternative, which allowed us to identify three candidate photoacid generators by the end of 2020. With our end-to-end, AI-powered workflow, IBM scaled and handled the problem with a speed that human scientists simply cannot match, dramatically accelerating the discovery process.
Deep Search uses AI to collect, convert, curate, and ultimately search huge document collections for information that is too specific for standard search tools to handle. It collects data from public, private, structured, and unstructured sources and leverages state-of-the-art AI methods 3456 to convert PDF documents into easily decipherable JSON format with a uniform schema ideal for today’s data scientists. It then applies dedicated natural language processing and computer vision machine-learning algorithms on these documents and ultimately creates searchable knowledge graphs.
The resulting datasets can help businesses make models and identify key trends that inform their decisions. For example, they could match a target acquisition’s financial performance over the past five years, as well as executive turnover during that time. There are exciting applications for Deep Search in healthcare, climate science, and materials research — anywhere large document collections have to be searched — and Deep Search makes it easier to get started.
Deep Search previously required users to provide their data or documents to be searched. IBM has now added more than 364 million public documents, such as patents and research papers. Commercial users of Deep Search can quickly get started searching this data, adding their own data incrementally.
The public release of our automatic document conversion service is only the first step for DS4SD. New capabilities, such as AI models and high quality data-sources, will be made available in the future.