Automating Document Classification with Distant Supervision to Increase the Efficiency of Systematic Reviews

by   Xiaoxiao Li, et al.

Objective: Systematic reviews of scholarly documents often provide complete and exhaustive summaries of literature relevant to a research question. However, well-done systematic reviews are expensive, time-demanding, and labor-intensive. Here, we propose an automatic document classification approach to significantly reduce the effort in reviewing documents. Methods: We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based refined method, and a random forest approach that utilizes a large set of feature tokens. As an example, this approach is used to identify documents studying female sex workers that are assumed to contain content relevant to either HIV or violence. We compare the performance of the three classifiers by cross-validation and conduct a sensitivity analysis on the portion of data utilized in training the model. Results: The random forest approach provides the highest area under the curve (AUC) for both receiver operating characteristic (ROC) and precision/recall (PR). Analyses of precision and recall suggest that random forest could facilitate manually reviewing 20% of the articles while containing 80% of the relevant cases. Finally, we found a good classifier could be obtained by using a relatively small training sample size. Conclusions: In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews, as well as facilitating live reviews, where reviews are updated regularly.


Random Forest based Qantile Oriented Sensitivity Analysis indices estimation

We propose a random forest based estimation procedure for Quantile Orien...

A Meta-analytical Comparison of Naive Bayes and Random Forest for Software Defect Prediction

Is there a statistical difference between Naive Bayes and Random Forest ...

A shared latent space matrix factorisation method for recommending new trial evidence for systematic review updates

Clinical trial registries can be used to monitor the production of trial...

Predicting Research that will be Cited in Policy Documents

Scientific publications and other genres of research output are increasi...

Academic information retrieval using citation clusters: In-depth evaluation based on systematic reviews

The field of scientometrics has shown the power of citation-based cluste...

Mirror Matching: Document Matching Approach in Seed-driven Document Ranking for Medical Systematic Reviews

When medical researchers conduct a systematic review (SR), screening stu...

Automation of Hemocompatibility Analysis Using Image Segmentation and a Random Forest

The hemocompatibility of blood-contacting medical devices remains one of...

Please sign up or login with your details

Forgot password? Click here to reset