Encord, the platform for data-centric computer vision, has released Encord Active, a free open source industry agnostic toolkit that enables machine learning (ML) engineers and data scientists to understand and improve their training data quality and help boost model performance.
For many use cases, such as self-driving cars and diagnostic medical models, AI suffers from a “production gap” between successful proof-of-concept models and models capable of running “in the wild.” Proof-of-concept models perform well in research environments but struggle to make predictions accurately and consistently in real-world scenarios. This gap is due to issues of model robustness and reliability which have hindered the widespread adoption of AI.
With Encord’s open source toolkit, ML engineers can bridge this gap using a new approach for investigating the quality of their data, labels, and model performance. Data and label errors can severely impact a model’s performance, so continuously evaluating and improving training datasets is critical for ensuring high-quality predictions. Encord's new tool empowers machine learning teams to find failure modes in their models, prioritise high-value data for labelling, and drive smart data curation to improve model performance.
Active learning, a process for training models in which the model asks for data that can help improve its performance, has gained traction as a theory among researchers, start-ups, and enterprises. Smaller AI companies, however, have not yet been able to implement usable active learning techniques. Encord Active allows companies of all sizes to move from theory to implementation by providing a new methodology based on “quality metrics.” Quality metrics are computed indexes added on top of your data, labels, and models based on human-explainable concepts.
Current active learning methods rely on ML engineers building their own tools and creating their own versions of quality metrics, making the process a time-consuming and expensive approach. Encord Active removes that work by automating computation of an assortment of pre-built quality metrics across the data, labels, and model predictions.
“As many ML engineers know, the performance of all models depends on the quality of their training data. Encord Active is first and foremost a framework built to help machine learning engineers understand and improve their data quality iteratively and effectively, ” said Eric Landau, Co-Founder and CEO at Encord. “We want to contribute to the progression of the computer vision space as much as possible, so making Encord Active open source was a no-brainer.”
The quality metrics approach focuses on the automatic calculation of characteristics of images, labels, model predictions, and metadata. ML teams are then presented with a breakdown of their data, label distribution, and model performance by each metric. These insights allow them to:
-
Find unknown failure modes in their datasets.
-
Inspect whether their dataset is balanced across the different metrics and balance their dataset based on the quality metrics prior to labeling or training a model.
-
Identify potential outliers in their dataset that can then be removed if they are unnecessary for the use case.
Encord Active is also the first tool to provide actionable end-to-end active learning workflows to create an environment where models can continuously learn and improve, similar to how humans do. Within the Encord ecosystem, users can not only find valuable data to label and find label errors to re-label but also complete the workflow cycle to fix these issues.
Encord is backed by CRV, Y Combinator, WndrCo, and Crane Venture Partners and trusted by the likes of world-leading healthcare institutions including King's College London where it helped to annotate pre-cancerous polyp videos resulting in increased efficiency by an average of 6.4x, and automated 97% of labels ultimately making the most expensive clinician become 16x more efficient at labeling medical images. It has also worked with Memorial Sloan Kettering Cancer Center and Stanford Medical Centre where it has reduced experiment duration by 80% and processed 3x more images.
Encord Active is available on GitHub