Synthetic data, real errors: how (not) to publish and use synthetic data

05/16/2023
by   Boris van Breugel, et al.
1

Generating synthetic data through generative models is gaining interest in the ML community and beyond, promising a future where datasets can be tailored to individual needs. Unfortunately, synthetic data is usually not perfect, resulting in potential errors in downstream tasks. In this work we explore how the generative process affects the downstream ML task. We show that the naive synthetic data approach – using synthetic data as if it is real – leads to downstream models and analyses that do not generalize well to real data. As a first step towards better ML in the synthetic data regime, we introduce Deep Generative Ensemble (DGE) – a framework inspired by Deep Ensembles that aims to implicitly approximate the posterior distribution over the generative process model parameters. DGE improves downstream model training, evaluation, and uncertainty quantification, vastly outperforming the naive approach on average. The largest improvements are achieved for minority classes and low-density regions of the original data, for which the generative uncertainty is largest.

READ FULL TEXT

page 1

page 8

research
04/07/2023

Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data

Generating synthetic data through generative models is gaining interest ...
research
06/27/2023

On the Usefulness of Synthetic Tabular Data Generation

Despite recent advances in synthetic data generation, the scientific com...
research
07/27/2022

Towards Clear Expectations for Uncertainty Estimation

If Uncertainty Quantification (UQ) is crucial to achieve trustworthy Mac...
research
05/26/2023

On Consistent Bayesian Inference from Synthetic Data

Generating synthetic data, with or without differential privacy, has att...
research
04/24/2023

A Study on Improving Realism of Synthetic Data for Machine Learning

Synthetic-to-real data translation using generative adversarial learning...
research
09/10/2023

A supervised generative optimization approach for tabular data

Synthetic data generation has emerged as a crucial topic for financial i...
research
05/30/2023

How Generative Models Improve LOS Estimation in 6G Non-Terrestrial Networks

With the advent of 5G and the anticipated arrival of 6G, there has been ...

Please sign up or login with your details

Forgot password? Click here to reset