The NCI Imaging Data Commons as a platform for reproducible research in computational pathology

03/16/2023
by   Daniela P. Schacherer, et al.
0

Objective: Reproducibility is critical for translating machine learning-based (ML) solutions in computational pathology (CompPath) into practice. However, an increasing number of studies report difficulties in reproducing ML results. The NCI Imaging Data Commons (IDC) is a public repository of >120 cancer image collections, including >38,000 whole-slide images (WSIs), that is designed to be used with cloud-based ML services. Here, we explore the potential of the IDC to facilitate reproducibility of CompPath research. Materials and Methods: The IDC realizes the FAIR principles: All images are encoded according to the DICOM standard, persistently identified, discoverable via rich metadata, and accessible via open tools. Taking advantage of this, we implemented two experiments in which a representative ML-based method for classifying lung tumor tissue was trained and/or evaluated on different datasets from the IDC. To assess reproducibility, the experiments were run multiple times with independent but identically configured sessions of common ML services. Results: The AUC values of different runs of the same experiment were generally consistent and in the same order of magnitude as a similar, previously published study. However, there were occasional small variations in AUC values of up to 0.044, indicating a practical limit to reproducibility. Discussion and conclusion: By realizing the FAIR principles, the IDC enables other researchers to reuse exactly the same datasets. Cloud-based ML services enable others to run CompPath experiments in an identically configured computing environment without having to own high-performance hardware. The combination of both makes it possible to approach the reproducibility limit.

READ FULL TEXT

page 5

page 6

research
06/22/2020

Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles

Machine learning (ML) is an increasingly important scientific tool suppo...
research
04/12/2017

Lago Distributed Network Of Data Repositories

We describe a set of tools, services and strategies of the Latin America...
research
08/31/2019

Exploring Reproducibility and FAIR Principles in Data Science Using Ecological Niche Modeling as a Case Study

Reproducibility is a fundamental requirement of the scientific process s...
research
08/04/2018

ReproServer: Making Reproducibility Easier and Less Intensive

Reproducibility in the computational sciences has been stymied because o...
research
10/07/2021

Sim2Ls: FAIR simulation workflows and data

Just like the scientific data they generate, simulation workflows for re...
research
09/02/2021

Quantifying Reproducibility in NLP and ML

Reproducibility has become an intensely debated topic in NLP and ML over...
research
08/16/2022

SIERRA: A Modular Framework for Research Automation and Reproducibility

Modern intelligent systems researchers form hypotheses about system beha...

Please sign up or login with your details

Forgot password? Click here to reset