A Unified Framework for Quantifying Privacy Risk in Synthetic Data

by   Matteo Giomi, et al.

Synthetic data is often presented as a method for sharing sensitive information in a privacy-preserving manner by reproducing the global statistical properties of the original data without disclosing sensitive information about any individual. In practice, as with other anonymization methods, privacy risks cannot be entirely eliminated. The residual privacy risks need instead to be ex-post assessed. We present Anonymeter, a statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets. We equip this framework with attack-based evaluations for the singling out, linkability, and inference risks, the three key indicators of factual anonymization according to the European General Data Protection Regulation (GDPR). To the best of our knowledge, we are the first to introduce a coherent and legally aligned evaluation of these three privacy risks for synthetic data, and to design privacy attacks which model directly the singling out and linkability risks. We demonstrate the effectiveness of our methods by conducting an extensive set of experiments that measure the privacy risks of data with deliberately inserted privacy leakages, and of synthetic data generated with and without differential privacy. Our results highlight that the three privacy risks reported by our framework scale linearly with the amount of privacy leakage in the data. Furthermore, we observe that synthetic data exhibits the lowest vulnerability against linkability, indicating one-to-one relationships between real and synthetic data records are not preserved. Finally, we demonstrate quantitatively that Anonymeter outperforms existing synthetic data privacy evaluation frameworks both in terms of detecting privacy leaks, as well as computation speed. To contribute to a privacy-conscious usage of synthetic data, we open source Anonymeter at https://github.com/statice/anonymeter.


A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data

Personal data collected at scale from surveys or digital devices offers ...

Synthetic Data – A Privacy Mirage

Synthetic datasets drawn from generative models have been advertised as ...

Synthetic Data for Social Good

Data for good implies unfettered access to data. But data owners must be...

Design of a Privacy-Preserving Data Platform for Collaboration Against Human Trafficking

Case records on identified victims of human trafficking are highly sensi...

Fidelity and Privacy of Synthetic Medical Data

The digitization of medical records ushered in a new era of big data to ...

Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data

AI-based data synthesis has seen rapid progress over the last several ye...

Exact and Efficient Bayesian Inference for Privacy Risk Quantification (Extended Version)

Data analysis has high value both for commercial and research purposes. ...

Please sign up or login with your details

Forgot password? Click here to reset