Diverse Community Data for Benchmarking Data Privacy Algorithms

06/20/2023
by   Aniruddha Sen, et al.
0

The Diverse Communities Data Excerpts are the core of a National Institute of Standards and Technology (NIST) program to strengthen understanding of tabular data deidentification technologies such as synthetic data. Synthetic data is an ambitious attempt to democratize the benefits of big data; it uses generative models to recreate sensitive personal data with new records for public release. However, it is vulnerable to the same bias and privacy issues that impact other machine learning applications, and can even amplify those issues. When deidentified data distributions introduce bias or artifacts, or leak sensitive information, they propagate these problems to downstream applications. Furthermore, real-world survey conditions such as diverse subpopulations, heterogeneous non-ordinal data spaces, and complex dependencies between features pose specific challenges for synthetic data algorithms. These observations motivate the need for real, diverse, and complex benchmark data to support a robust understanding of algorithm behavior. This paper introduces four contributions: new theoretical work on the relationship between diverse populations and challenges for equitable deidentification; public benchmark data focused on diverse populations and challenging features curated from the American Community Survey; an open source suite of evaluation metrology for deidentified datasets; and an archive of evaluation results on a broad collection of deidentification techniques. The initial set of evaluation results demonstrate the suitability of these tools for investigations in this field.

READ FULL TEXT

page 8

page 32

research
08/24/2021

Bias Mitigated Learning from Differentially Private Synthetic Data: A Cautionary Tale

Increasing interest in privacy-preserving machine learning has led to ne...
research
10/16/2022

Evaluation of the Synthetic Electronic Health Records

Generative models have been found effective for data synthesis due to th...
research
03/01/2023

What Is Synthetic Data? The Good, The Bad, and The Ugly

Sharing data can often enable compelling applications and analytics. How...
research
05/10/2023

RiverBench: an Open RDF Streaming Benchmark Suite

RDF data streaming has been explored by the Semantic Web community from ...
research
04/13/2022

Enabling Synthetic Data adoption in regulated domains

The switch from a Model-Centric to a Data-Centric mindset is putting emp...
research
07/14/2021

Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers

Understanding the strengths and weaknesses of machine learning (ML) algo...
research
09/29/2018

Pulse: Toward a Smart Campus by Communicating Real-time Wi-Fi Access Data

To enhance the mobility and convenience of the campus community, we desi...

Please sign up or login with your details

Forgot password? Click here to reset