Datasheets for Datasets

03/23/2018
by   Timnit Gebru, et al.
0

Currently there is no standard way to identify how a dataset was created, and what characteristics, motivations, and potential skews it represents. To begin to address this issue, we propose the concept of a datasheet for datasets, a short document to accompany public datasets, commercial APIs, and pretrained models. The goal of this proposal is to enable better communication between dataset creators and users, and help the AI community move toward greater transparency and accountability. By analogy, in computer hardware, it has become industry standard to accompany everything from the simplest components (e.g., resistors), to the most complex microprocessor chips, with datasheets detailing standard operating characteristics, test results, recommended usage, and other information. We outline some of the questions a datasheet for datasets should answer. These questions focus on when, where, and how the training data was gathered, its recommended use cases, and, in the case of human-centric datasets, information regarding the subjects' demographics and consent as applicable. We develop prototypes of datasheets for two well-known datasets: Labeled Faces in The Wild lfw and the Pang & Lee Polarity Dataset polarity.

READ FULL TEXT

page 7

page 9

page 10

page 15

page 17

page 23

page 26

research
03/06/2023

Data Portraits: Recording Foundation Model Training Data

Foundation models are trained on increasingly immense and opaque dataset...
research
04/06/2023

Replicability and Transparency for the Creation of Public Human User Video Game Datasets

Replicability is absent in games research; a lack of transparency in pro...
research
10/22/2020

Challenges in Information Seeking QA:Unanswerable Questions and Paragraph Retrieval

Recent progress in pretrained language model "solved" many reading compr...
research
10/07/2022

BlanketSet – A clinical real word action recognition and qualitative semi-synchronised MoCap dataset

Recent advancements in computer vision, particularly by making use of de...
research
08/08/2023

The Inverse Transparency Toolchain: A Fully Integrated and Quickly Deployable Data Usage Logging Infrastructure

Inverse transparency is created by making all usages of employee data vi...
research
09/25/2021

Finetuning Transformer Models to Build ASAG System

Research towards creating systems for automatic grading of student answe...
research
06/09/2022

MIMICS-Duo: Offline Online Evaluation of Search Clarification

Asking clarification questions is an active area of research; however, r...

Please sign up or login with your details

Forgot password? Click here to reset