Enabling Synthetic Data adoption in regulated domains

04/13/2022
by   Giorgio Visani, et al.
0

The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms, bringing forward new challenges. In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for. Specific approaches to address the privacy issue have been developed, as Privacy Enhancing Technologies. However, they frequently cause loss of information, putting forward a crucial trade-off among data quality and privacy. A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties. Both Academia and Industry realized the importance of evaluating synthetic data quality: without all-round reliable metrics, the innovative data generation task has no proper objective function to maximize. Despite that, the topic remains under-explored. For this reason, we systematically catalog the important traits of synthetic data quality and privacy, and devise a specific methodology to test them. The result is DAISYnt (aDoption of Artificial Intelligence SYnthesis): a comprehensive suite of advanced tests, which sets a de facto standard for synthetic data evaluation. As a practical use-case, a variety of generative algorithms have been trained on real-world Credit Bureau Data. The best model has been assessed, using DAISYnt on the different synthetic replicas. Further potential uses, among others, entail auditing and fine-tuning of generative models or ensuring high quality of a given synthetic dataset. From a prescriptive viewpoint, eventually, DAISYnt may pave the way to synthetic data adoption in highly regulated domains, ranging from Finance to Healthcare, through Insurance and Education.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/29/2018

Measuring the quality of Synthetic data for use in competitions

Machine learning has the potential to assist many communities in using t...
research
07/28/2023

Deep Generative Models, Synthetic Tabular Data, and Differential Privacy: An Overview and Synthesis

This article provides a comprehensive synthesis of the recent developmen...
research
02/08/2023

Machine Learning for Synthetic Data Generation: a Review

Data plays a crucial role in machine learning. However, in real-world ap...
research
07/09/2023

On the Challenges of Deploying Privacy-Preserving Synthetic Data in the Enterprise

Generative AI technologies are gaining unprecedented popularity, causing...
research
08/31/2023

The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development

In the current data driven era, synthetic data, artificially generated d...
research
04/21/2023

Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

Data collected from the real world tends to be biased, unbalanced, and a...
research
06/20/2023

Diverse Community Data for Benchmarking Data Privacy Algorithms

The Diverse Communities Data Excerpts are the core of a National Institu...

Please sign up or login with your details

Forgot password? Click here to reset