On the Problem of p_1^-1 in Locality-Sensitive Hashing
A Locality-Sensitive Hash (LSH) function is called (r,cr,p_1,p_2)-sensitive, if two data-points with a distance less than r collide with probability at least p_1 while data points with a distance greater than cr collide with probability at most p_2. These functions form the basis of the successful Indyk-Motwani algorithm (STOC 1998) for nearest neighbour problems. In particular one may build a c-approximate nearest neighbour data structure with query time Õ(n^ρ/p_1) where ρ=log1/p_1/log1/p_2∈(0,1). That is, sub-linear time, as long as p_1 is not too small. This is significant since most high dimensional nearest neighbour problems suffer from the curse of dimensionality, and can't be solved exact, faster than a brute force linear-time scan of the database. Unfortunately, the best LSH functions tend to have very low collision probabilities, p_1 and p_2. Including the best functions for Cosine and Jaccard Similarity. This means that the n^ρ/p_1 query time of LSH is often not sub-linear after all, even for approximate nearest neighbours! In this paper, we improve the general Indyk-Motwani algorithm to reduce the query time of LSH to Õ(n^ρ/p_1^1-ρ) (and the space usage correspondingly.) Since n^ρ p_1^ρ-1 < n ⇔ p_1 > n^-1, our algorithm always obtains sublinear query time, for any collision probabilities at least 1/n. For p_1 and p_2 small enough, our improvement over all previous methods can be up to a factor n in both query time and space. The improvement comes from a simple change to the Indyk-Motwani algorithm, which can easily be implemented in existing software packages.
READ FULL TEXT