Deep Entity Matching with Pre-Trained Language Models

04/01/2020
by   Yuliang Li, et al.
0

We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29 also developed three optimization techniques to further improve Ditto's matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of Ditto by up to 9.8 surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto's effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2020

Unsupervised Paraphrase Generation using Pre-trained Language Models

Large scale Pre-trained Language Models have proven to be very powerful ...
research
06/10/2022

Machop: an End-to-End Generalized Entity Matching Framework

Real-world applications frequently seek to solve a general form of the E...
research
01/12/2023

KAER: A Knowledge Augmented Pre-Trained Language Model for Entity Resolution

Entity resolution has been an essential and well-studied task in data cl...
research
06/08/2021

Interpretable and Low-Resource Entity Matching via Decoupling Feature Learning from Decision Making

Entity Matching (EM) aims at recognizing entity records that denote the ...
research
11/12/2021

Extraction of Medication Names from Twitter Using Augmentation and an Ensemble of Language Models

The BioCreative VII Track 3 challenge focused on the identification of m...
research
04/24/2023

Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis Benchmark]

Many recent works on Entity Resolution (ER) leverage Deep Learning techn...
research
03/15/2022

Evaluating BERT-based Pre-training Language Models for Detecting Misinformation

It is challenging to control the quality of online information due to th...

Please sign up or login with your details

Forgot password? Click here to reset