Missing Data Imputation using Optimal Transport

by   Boris Muzellec, et al.

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.


page 6

page 7

page 12

page 13

page 14

page 15

page 16


Multiple Imputation Using Deep Denoising Autoencoders

Missing data is a well-recognized problem impacting all domains. State-o...

Minimax rate of consistency for linear models with missing values

Missing values arise in most real-world data sets due to the aggregation...

A Modulation Layer to Increase Neural Network Robustness Against Data Quality Issues

Data quality is a common problem in machine learning, especially in high...

Deep Distribution-preserving Incomplete Clustering with Optimal Transport

Clustering is a fundamental task in the computer vision and machine lear...

Using Undersampling with Ensemble Learning to Identify Factors Contributing to Preterm Birth

In this paper, we propose Ensemble Learning models to identify factors c...

Conditional expectation for missing data imputation

Missing data is common in datasets retrieved in various areas, such as m...

Code Repositories


A Pytorch implementation of missing data imputation using optimal transport.

view repo

Please sign up or login with your details

Forgot password? Click here to reset