Exact and Approximate Hierarchical Clustering Using A*

by   Craig S. Greenberg, et al.

Hierarchical clustering is a critical task in numerous domains. Many approaches are based on heuristics and the properties of the resulting clusterings are studied post hoc. However, in several applications, there is a natural cost function that can be used to characterize the quality of the clustering. In those cases, hierarchical clustering can be seen as a combinatorial optimization problem. To that end, we introduce a new approach based on A* search. We overcome the prohibitively large search space by combining A* with a novel trellis data structure. This combination results in an exact algorithm that scales beyond previous state of the art, from a search space with 10^12 trees to 10^15 trees, and an approximate algorithm that improves over baselines, even in enormous search spaces that contain more than 10^1000 trees. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks. We describe how our method provides significantly improved theoretical bounds on the time and space complexity of A* for clustering.


page 1

page 2

page 3

page 4


Compact Representation of Uncertainty in Hierarchical Clustering

Hierarchical clustering is a fundamental task often used to discover mea...

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

Similarity-based Hierarchical Clustering (HC) is a classical unsupervise...

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs

This paper presents two efficient hierarchical clustering (HC) algorithm...

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

We introduce TeraHAC, a (1+ϵ)-approximate hierarchical agglomerative clu...

Hierarchical clustering in particle physics through reinforcement learning

Particle physics experiments often require the reconstruction of decay p...

Improving Quality of Hierarchical Clustering for Large Data Series

Brown clustering is a hard, hierarchical, bottom-up clustering of words ...

Clustering and Labelling Auction Fraud Data

Although shill bidding is a common auction fraud, it is however very tou...

Please sign up or login with your details

Forgot password? Click here to reset