Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus

05/11/2021
by   Jack Bandy, et al.
0

Recent literature has underscored the importance of dataset documentation work for machine learning, and part of this work involves addressing "documentation debt" for datasets that have been used widely but documented sparsely. This paper aims to help address documentation debt for BookCorpus, a popular text dataset for training large language models. Notably, researchers have used BookCorpus to train OpenAI's GPT-N models and Google's BERT models, even though little to no documentation exists about the dataset's motivation, composition, collection process, etc. We offer a preliminary datasheet that provides key context and information about BookCorpus, highlighting several notable deficiencies. In particular, we find evidence that (1) BookCorpus likely violates copyright restrictions for many books, (2) BookCorpus contains thousands of duplicated books, and (3) BookCorpus exhibits significant skews in genre representation. We also find hints of other potential deficiencies that call for future research, including problematic content, potential skews in religious representation, and lopsided author contributions. While more work remains, this initial effort to provide a datasheet for BookCorpus adds to growing literature that urges more careful and systematic documentation for machine learning datasets.

READ FULL TEXT
research
07/06/2023

Trends in Machine Learning and Electroencephalogram (EEG): A Review for Undergraduate Researchers

This paper presents a systematic literature review on Brain-Computer Int...
research
06/07/2020

Tropes in films: an initial analysis

TVTropes is a wiki that describes tropes and which ones are used in whic...
research
10/05/2021

Multimodal datasets: misogyny, pornography, and malignant stereotypes

We have now entered the era of trillion parameter machine learning model...
research
11/06/2021

Patent Sentiment Analysis to Highlight Patent Paragraphs

Given a patent document, identifying distinct semantic annotations is an...
research
05/22/2020

Comparative Study of Machine Learning Models and BERT on SQuAD

This study aims to provide a comparative analysis of performance of cert...
research
03/10/2021

Automated liver tissues delineation based on machine learning techniques: A survey, current trends and future orientations

There is no denying how machine learning and computer vision have grown ...
research
02/03/2021

Problematic Machine Behavior: A Systematic Literature Review of Algorithm Audits

While algorithm audits are growing rapidly in commonality and public imp...

Please sign up or login with your details

Forgot password? Click here to reset