Mahalanonbis Distance Informed by Clustering

08/13/2017
by   Almog Lahav, et al.
0

A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored - which is the structure stemming from the relationships between the coordinates. Specifically we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space.We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan-Meier survival plot.

READ FULL TEXT

page 8

page 15

page 17

page 19

research
04/06/2015

A Probabilistic ℓ_1 Method for Clustering High Dimensional Data

In general, the clustering problem is NP-hard, and global optimality can...
research
06/14/2016

Local Canonical Correlation Analysis for Nonlinear Common Variables Discovery

In this paper, we address the problem of hidden common variables discove...
research
06/30/2020

An Approach for Clustering Subjects According to Similarities in Cell Distributions within Biopsies

In this paper, we introduce a novel and interpretable methodology to clu...
research
11/29/2019

Minkowski distances and standardisation for clustering and classification of high dimensional data

There are many distance-based methods for classification and clustering,...
research
10/30/2017

Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models

We consider classifiers for high-dimensional data under the strongly spi...
research
06/14/2021

Full interpretable machine learning in 2D with inline coordinates

This paper proposed a new methodology for machine learning in 2-dimensio...
research
10/27/2019

Distance approximation using Isolation Forests

This work briefly explores the possibility of approximating spatial dist...

Please sign up or login with your details

Forgot password? Click here to reset