Density-optimized Intersection-free Mapping and Matrix Multiplication for Join-Project Operations (extended version)

by   Zichun Huang, et al.

A Join-Project operation is a join operation followed by a duplicate eliminating projection operation. It is used in a large variety of applications, including entity matching, set analytics, and graph analytics. Previous work proposes a hybrid design that exploits the classical solution (i.e., join and deduplication), and MM (matrix multiplication) to process the sparse and the dense portions of the input data, respectively. However, we observe three problems in the state-of-the-art solution: 1) The outputs of the sparse and dense portions overlap, requiring an extra deduplication step; 2) Its table-to-matrix transformation makes an over-simplified assumption of the attribute values; and 3) There is a mismatch between the employed MM in BLAS packages and the characteristics of the Join-Project operation. In this paper, we propose DIM3, an optimized algorithm for the Join-Project operation. To address 1), we propose an intersection-free partition method to completely remove the final deduplication step. For 2), we develop an optimized design for mapping attribute values to natural numbers. For 3), we propose DenseEC and SparseBMM algorithms to exploit the structure of Join-Project for better efficiency. Moreover, we extend DIM3 to consider partial result caching and support Join-op queries, including Join-Aggregate and MJP (Multi-way Joins with Projection). Experimental results using both real-world and synthetic data sets show that DIM3 outperforms previous Join-Project solutions by a factor of 2.3x-18x. Compared to RDBMSs, DIM3 achieves orders of magnitude speedups.


Fast Join Project Query Evaluation using Matrix Multiplication

In the last few years, much effort has been devoted to developing join a...

Enumeration Algorithms for Conjunctive Queries with Projection

We investigate the enumeration of query results for an important subset ...

APRIL: Approximating Polygons as Raster Interval Lists

The spatial intersection join an important spatial query operation, due ...

Discovering Multi-Table Functional Dependencies Without Full Join Computation

In this paper, we study the problem of discovering join FDs, i.e., funct...

Scaling and Load-Balancing Equi-Joins

The task of joining two tables is fundamental for querying databases. In...

Distributed Subtrajectory Join on Massive Datasets

Joining trajectory datasets is a significant operation in mobility data ...

Efficiently Transforming Tables for Joinability

Data from different sources rarely conform to a single formatting even i...

Please sign up or login with your details

Forgot password? Click here to reset