Fixed and adaptive landmark sets for finite pseudometric spaces

by   Jason Cory Brunson, et al.

Topological data analysis (TDA) is an expanding field that leverages principles and tools from algebraic topology to quantify structural features of data sets or transform them into more manageable forms. As its theoretical foundations have been developed, TDA has shown promise in extracting useful information from high-dimensional, noisy, and complex data such as those used in biomedicine. To improve efficiency, these techniques may employ landmark samplers. The heuristic maxmin procedure obtains a roughly even distribution of sample points by implicitly constructing a cover comprising sets of uniform radius. However, issues arise with data that vary in density or include points with multiplicities, as are common in biomedicine. We propose an analogous procedure, "lastfirst" based on ranked distances, which implies a cover comprising sets of uniform cardinality. We first rigorously define the procedure and prove that it obtains landmarks with desired properties. We then perform benchmark tests and compare its performance to that of maxmin, on feature detection and class prediction tasks involving simulated and real-world biomedical data. Lastfirst is more general than maxmin in that it can be applied to any data on which arbitrary (and not necessarily symmetric) pairwise distances can be computed. Lastfirst is more computationally costly, but our implementation scales at the same rate as maxmin. We find that lastfirst achieves comparable performance on prediction tasks and outperforms maxmin on homology detection tasks. Where the numerical values of similarity measures are not meaningful, as in many biomedical contexts, lastfirst sampling may also improve interpretability.


Topological Data Analysis with ε-net Induced Lazy Witness Complex

Topological data analysis computes and analyses topological features of ...

Persistent Intersection Homology for the Analysis of Discrete Data

Topological data analysis is becoming increasingly relevant to support t...

The classification for High-dimension low-sample size data

Huge amount of applications in various fields, such as gene expression a...

Metricizing the Euclidean Space towards Desired Distance Relations in Point Clouds

Given a set of points in the Euclidean space ℝ^ℓ with ℓ>1, the pairwise ...

Robust Similarity and Distance Learning via Decision Forests

Canonical distances such as Euclidean distance often fail to capture the...

Sampling by Reversing The Landmarking Process

Variations of the commonly applied landmark sampling are presented. Thes...

Please sign up or login with your details

Forgot password? Click here to reset