Sample Size in Natural Language Processing within Healthcare Research

09/05/2023
by   Jaya Chaturvedi, et al.
0

Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine the precision of estimates. This is true for most quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text is used to generate predictions and classify instances of text. Within the healthcare domain, the lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies. This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain. Models trained on the MIMIC-III database of critical care records from Beth Israel Deaconess Medical Center were used to classify documents as having or not having Unspecified Essential Hypertension, the most common diagnosis code in the database. Simulations were performed using various classifiers on different sample sizes and class proportions. This was repeated for a comparatively less common diagnosis code within the database of diabetes mellitus without mention of complication. Smaller sample sizes resulted in better results when using a K-nearest neighbours classifier, whereas larger sample sizes provided better results with support vector machines and BERT models. Overall, a sample size larger than 1000 was sufficient to provide decent performance metrics. The simulations conducted within this study provide guidelines that can be used as recommendations for selecting appropriate sample sizes and class proportions, and for predicting expected performance, when building classifiers for textual healthcare data. The methodology used here can be modified for sample size estimates calculations with other datasets.

READ FULL TEXT
research
05/12/2021

Sample size planning for pilot studies

Pilot studies are often the first step of experimental research. It is u...
research
11/06/2012

Sample Size Planning for Classification Models

In biospectroscopy, suitably annotated and statistically independent sam...
research
09/15/2021

The Unreasonable Effectiveness of the Baseline: Discussing SVMs in Legal Text Classification

We aim to highlight an interesting trend to contribute to the ongoing de...
research
08/14/2020

Quantification of BERT Diagnosis Generalizability Across Medical Specialties Using Semantic Dataset Distance

Deep learning models in healthcare may fail to generalize on data from u...
research
06/17/2023

Reliability and repeatability of ISO 3382-3 metrics based on repeated acoustic measurements in open-plan offices

This paper investigates variability in the key ISO 3382-3:2012 metrics, ...
research
09/05/2018

Sample Design for Medicaid and Healthcare Audits

We develop several tools for the determination of sample size and design...
research
12/13/2019

Systematic Overestimation of Machine Learning Performance in Neuroimaging Studies of Depression

We currently observe a disconcerting phenomenon in machine learning stud...

Please sign up or login with your details

Forgot password? Click here to reset