Adaptive MapReduce Similarity Joins

04/16/2018
by   Samuel McCauley, et al.
0

Similarity joins are a fundamental database operation. Given data sets S and R, the goal of a similarity join is to find all points x in S and y in R with distance at most r. Recent research has investigated how locality-sensitive hashing (LSH) can be used for similarity join, and in particular two recent lines of work have made exciting progress on LSH-based join performance. Hu, Tao, and Yi (PODS 17) investigated joins in a massively parallel setting, showing strong results that adapt to the size of the output. Meanwhile, Ahle, Aumüller, and Pagh (SODA 17) showed a sequential algorithm that adapts to the structure of the data, matching classic bounds in the worst case but improving them significantly on more structured data. We show that this adaptive strategy can be adapted to the parallel setting, combining the advantages of these approaches. In particular, we show that a simple modification to Hu et al.'s algorithm achieves bounds that depend on the density of points in the dataset as well as the total outsize of the output. Our algorithm uses no extra parameters over other LSH approaches (in particular, its execution does not depend on the structure of the dataset), and is likely to be efficient in practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2011

Similarity Join Size Estimation using Locality Sensitive Hashing

Similarity joins are important operations with a broad range of applicat...
research
12/02/2021

Worst-case Optimal Binary Join Algorithms under General ℓ_p Constraints

Worst-case optimal join algorithms have so far been studied in two broad...
research
05/05/2021

Dynamic Enumeration of Similarity Joins

This paper considers enumerating answers to similarity-join queries unde...
research
07/21/2017

Scalable and robust set similarity join

Set similarity join is a fundamental and well-studied database operator....
research
03/06/2020

LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew

All-pairs set similarity is a widely used data mining task, even for lar...
research
04/09/2018

Set Similarity Search for Skewed Data

Set similarity join, as well as the corresponding indexing problem set s...
research
06/24/2023

Join Size Bounds using Lp-Norms on Degree Sequences

Estimating the output size of a join query is a fundamental yet longstan...

Please sign up or login with your details

Forgot password? Click here to reset