DeepAI AI Chat
Log In Sign Up

Understanding the Representation and Representativeness of Age in AI Data Sets

by   Joon Sung Park, et al.

A diverse representation of different demographic groups in AI training data sets is important in ensuring that the models will work for a large range of users. To this end, recent efforts in AI fairness and inclusion have advocated for creating AI data sets that are well-balanced across race, gender, socioeconomic status, and disability status. In this paper, we contribute to this line of work by focusing on the representation of age by asking whether older adults are represented proportionally to the population at large in AI data sets. We examine publicly-available information about 92 face data sets to understand how they codify age as a case study to investigate how the subjects' ages are recorded and whether older generations are represented. We find that older adults are very under-represented; five data sets in the study that explicitly documented the closed age intervals of their subjects included older adults (defined as older than 65 years), while only one included oldest-old adults (defined as older than 85 years). Additionally, we find that only 24 of the data sets include any age-related information in their documentation or metadata, and that there is no consistent method followed across these data sets to collect and record the subjects' ages. We recognize the unique difficulties in creating representative data sets in terms of age, but raise it as an important dimension that researchers and engineers interested in inclusive AI should consider.


page 1

page 2

page 3

page 4


Data Representativeness in Accessibility Datasets: A Meta-Analysis

As data-driven systems are increasingly deployed at scale, ethical conce...

Towards measuring fairness in AI: the Casual Conversations dataset

This paper introduces a novel dataset to help researchers evaluate their...

Investigating Fairness of Ocular Biometrics Among Young, Middle-Aged, and Older Adults

A number of studies suggest bias of the face biometrics, i.e., face reco...

Coswara: A respiratory sounds and symptoms dataset for remote screening of SARS-CoV-2 infection

This paper presents the Coswara dataset, a dataset containing diverse se...

A Study of Age and Sex Bias in Multiple Instance Learning based Classification of Acute Myeloid Leukemia Subtypes

Accurate classification of Acute Myeloid Leukemia (AML) subtypes is cruc...

Synthetic Attribute Data for Evaluating Consumer-side Fairness

When evaluating recommender systems for their fairness, it may be necess...

Seeing Science

The ability to represent scientific data and concepts visually is becomi...