AutoBlock: A Hands-off Blocking Framework for Entity Matching

12/07/2019
by   Wei Zhang, et al.
0

Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record pairs to be matched. However, most of the traditional blocking methods are learning-free and key-based, and their successes are largely built on laborious human effort in cleaning data and designing blocking keys. In this paper, we propose AutoBlock, a novel hands-off blocking framework for entity matching, based on similarity-preserving representation learning and nearest neighbor search. Our contributions include: (a) Automation: AutoBlock frees users from laborious data cleaning and blocking key tuning. (b) Scalability: AutoBlock has a sub-quadratic total time complexity and can be easily deployed for millions of records. (c) Effectiveness: AutoBlock outperforms a wide range of competitive baselines on multiple large-scale, real-world datasets, especially when datasets are dirty and/or unstructured.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/20/2016

An Ensemble Blocking Scheme for Entity Resolution of Large and Sparse Datasets

Entity Resolution, also called record linkage or deduplication, refers t...
research
05/28/2020

Efficient and Effective ER with Progressive Blocking

Blocking is a mechanism to improve the efficiency of Entity Resolution (...
research
09/27/2017

Scaling Author Name Disambiguation with CNF Blocking

An author name disambiguation (AND) algorithm identifies a unique author...
research
02/25/2022

How to reduce the search space of Entity Resolution: with Blocking or Nearest Neighbor search?

Entity Resolution suffers from quadratic time complexity. To increase it...
research
08/19/2020

Scalable Blocking for Very Large Databases

In the field of database deduplication, the goal is to find approximatel...
research
03/06/2023

SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines

The goal of entity resolution is to identify records in multiple dataset...
research
10/02/2017

DeepER -- Deep Entity Resolution

Entity Resolution (ER) is a fundamental problem with many applications. ...

Please sign up or login with your details

Forgot password? Click here to reset