Mitigating dataset harms requires stewardship: Lessons from 1000 papers

08/06/2021
by   Kenny Peng, et al.
0

Concerns about privacy, bias, and harmful applications have shone a light on the ethics of machine learning datasets, even leading to the retraction of prominent datasets including DukeMTMC, MS-Celeb-1M, TinyImages, and VGGFace2. In response, the machine learning community has called for higher ethical standards, transparency efforts, and technical fixes in the dataset creation process. The premise of our work is that these efforts can be more effective if informed by an understanding of how datasets are used in practice in the research community. We study three influential face and person recognition datasets - DukeMTMC, MS-Celeb-1M, and Labeled Faces in the Wild (LFW) - by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach that can mitigate these harms, making recommendations to dataset creators, conference program committees, dataset users, and the broader research community.

READ FULL TEXT
research
10/18/2021

The Problem of Zombie Datasets:A Framework For Deprecating Datasets

What happens when a machine learning dataset is deprecated for legal, et...
research
02/07/2023

Ethical Considerations for Collecting Human-Centric Image Datasets

Human-centric image datasets are critical to the development of computer...
research
11/29/2018

Correspondence Analysis of Government Expenditure Patterns

We analyze expenditure patterns of discretionary funds by Brazilian cong...
research
11/27/2020

An Ethical Highlighter for People-Centric Dataset Creation

Important ethical concerns arising from computer vision datasets of peop...
research
06/08/2020

The Big Picture: Ethical Considerations and Statistical Analysis of Industry Involvement in Machine Learning Research

It is commonly believed among the machine learning (ML) community that i...
research
04/18/2021

Reconsidering CO2 emissions from Computer Vision

Climate change is a pressing issue that is currently affecting and will ...
research
06/14/2022

Four Years of FAccT: A Reflexive, Mixed-Methods Analysis of Research Contributions, Shortcomings, and Future Prospects

Fairness, Accountability, and Transparency (FAccT) for socio-technical s...

Please sign up or login with your details

Forgot password? Click here to reset