An experimental study on Synthetic Tabular Data Evaluation

11/19/2022
by   Javier Marin, et al.
0

In this paper, we present the findings of various methodologies for measuring the similarity of synthetic data generated from tabular data samples. We particularly apply our research to the case where the synthetic data has many more samples than the real data. This task has a special complexity: validating the reliability of this synthetically generated data with a much higher number of samples than the original. We evaluated the most commonly used global metrics found in the literature. We introduced a novel approach based on the data's topological signature analysis. Topological data analysis has several advantages in addressing this latter challenge. The study of qualitative geometric information focuses on geometric properties while neglecting quantitative distance function values. This is especially useful with high-dimensional synthetic data where the sample size has been significantly increased. It is comparable to introducing new data points into the data space within the limits set by the original data. Then, in large synthetic data spaces, points will be much more concentrated than in the original space, and their analysis will become much more sensitive to both the metrics used and noise. Instead, the concept of "closeness" between points is used for qualitative geometric information. Finally, we suggest an approach based on data Eigen vectors for evaluating the level of noise in synthetic data. This approach can also be used to assess the similarity of original and synthetic data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/24/2023

repliclust: Synthetic Data for Cluster Analysis

We present repliclust (from repli-cate and clust-er), a Python package f...
research
09/12/2022

Rule-adhering synthetic data – the lingua franca of learning

AI-generated synthetic data allows to distill the general patterns of ex...
research
11/02/2022

Web-based Elicitation of Human Perception on mixup Data

Synthetic data is proliferating on the web and powering many advances in...
research
09/16/2018

Testing SensoGraph, a geometric approach for fast sensory evaluation

This paper introduces SensoGraph, a novel approach for fast sensory eval...
research
12/26/2018

Group evolution patterns in running races

We address the problem of tracking and detecting interactions between th...
research
01/15/2022

Sample Summary with Generative Encoding

With increasing sample sizes, all algorithms require longer run times th...
research
03/15/2018

Strategies to facilitate access to detailed geocoding information using synthetic data

In this paper we investigate if generating synthetic data can be a viabl...

Please sign up or login with your details

Forgot password? Click here to reset