A Machine Learning Pipeline for Automatic Extraction of Statistic Reports and Experimental Conditions from Scientific Papers
A common writing style for statistical results are the recommendations of the American Psychology Association, known as APA-style. However, in practice, writing styles vary as reports are not 100 are not reported despite being mandatory. In addition, the statistics are not reported in isolation but in context of experimental conditions investigated and the general topic. We address these challenges by proposing a flexible pipeline STEREO based on active wrapper induction and unsupervised aspect extraction. We applied our pipeline to the over 100,000 documents in the CORD-19 dataset. It required only 0.25 learn statistics extraction rules that cover 95 The statistic extraction has nearly 100 precision on non-APA writing styles. In total, we were able to extract 113k reported statistics, of which only <1 the correct conditions from APA-conform reports (30 model for topic extraction achieves a precision of 75 in APA style (73 foundation for automatic statistic extraction and future developments for scientific paper analysis. Particularly the extraction of non-APA conform reports is important and allows applications such as giving feedback to authors about what is missing and could be changed.
READ FULL TEXT