An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation

05/31/2022
by   Jingyi Zhang, et al.
0

Subsampling methods aim to select a subsample as a surrogate for the observed sample. Such methods have been used pervasively in large-scale data analytics, active learning, and privacy-preserving analysis in recent decades. Instead of model-based methods, in this paper, we study model-free subsampling methods, which aim to identify a subsample that is not confined by model assumptions. Existing model-free subsampling methods are usually built upon clustering techniques or kernel tricks. Most of these methods suffer from either a large computational burden or a theoretical weakness. In particular, the theoretical weakness is that the empirical distribution of the selected subsample may not necessarily converge to the population distribution. Such computational and theoretical limitations hinder the broad applicability of model-free subsampling methods in practice. We propose a novel model-free subsampling method by utilizing optimal transport techniques. Moreover, we develop an efficient subsampling algorithm that is adaptive to the unknown probability density function. Theoretically, we show the selected subsample can be used for efficient density estimation by deriving the convergence rate for the proposed subsample kernel density estimator. We also provide the optimal bandwidth for the proposed estimator. Numerical studies on synthetic and real-world datasets demonstrate the performance of the proposed method is superior.

READ FULL TEXT
research
11/01/2017

Bandwidth selection for nonparametric modal regression

In the context of estimating local modes of a conditional density based ...
research
03/15/2022

TAKDE: Temporal Adaptive Kernel Density Estimator for Real-Time Dynamic Density Estimation

Real-time density estimation is ubiquitous in many applications, includi...
research
08/07/2022

Kernel Biclustering algorithm in Hilbert Spaces

Biclustering algorithms partition data and covariates simultaneously, pr...
research
06/28/2021

Adaptive greedy algorithm for moderately large dimensions in kernel conditional density estimation

This paper studies the estimation of the conditional density f (x, ×) of...
research
12/27/2019

Efficient Data Analytics on Augmented Similarity Triplets

Many machine learning methods (classification, clustering, etc.) start w...
research
03/21/2021

A deep learning approach to data-driven model-free pricing and to martingale optimal transport

We introduce a novel and highly tractable supervised learning approach b...
research
01/28/2023

Decentralized Entropic Optimal Transport for Privacy-preserving Distributed Distribution Comparison

Privacy-preserving distributed distribution comparison measures the dist...

Please sign up or login with your details

Forgot password? Click here to reset