Robust Learning for Text Classification with Multi-source Noise Simulation and Hard Example Mining

07/15/2021
by   Guowei Xu, et al.
0

Many real-world applications involve the use of Optical Character Recognition (OCR) engines to transform handwritten images into transcripts on which downstream Natural Language Processing (NLP) models are applied. In this process, OCR engines may introduce errors and inputs to downstream NLP models become noisy. Despite that pre-trained models achieve state-of-the-art performance in many NLP benchmarks, we prove that they are not robust to noisy texts generated by real OCR engines. This greatly limits the application of NLP models in real-world scenarios. In order to improve model performance on noisy OCR transcripts, it is natural to train the NLP model on labelled noisy texts. However, in most cases there are only labelled clean texts. Since there is no handwritten pictures corresponding to the text, it is impossible to directly use the recognition model to obtain noisy labelled data. Human resources can be employed to copy texts and take pictures, but it is extremely expensive considering the size of data for model training. Consequently, we are interested in making NLP models intrinsically robust to OCR errors in a low resource manner. We propose a novel robust training framework which 1) employs simple but effective methods to directly simulate natural OCR noises from clean texts and 2) iteratively mines the hard examples from a large number of simulated samples for optimal performance. 3) To make our model learn noise-invariant representations, a stability loss is employed. Experiments on three real-world datasets show that the proposed framework boosts the robustness of pre-trained models by a large margin. We believe that this work can greatly promote the application of NLP models in actual scenarios, although the algorithm we use is simple and straightforward. We make our codes and three datasets publicly available[https://github.com/tal-ai/Robust-learning-MSSHEM].

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2022

EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing

The success of Pre-Trained Models (PTMs) has reshaped the development of...
research
12/05/2021

Hard Sample Aware Noise Robust Learning for Histopathology Image Classification

Deep learning-based histopathology image classification is a key techniq...
research
02/08/2023

GPTScore: Evaluate as You Desire

Generative Artificial Intelligence (AI) has enabled the development of s...
research
10/30/2019

Contextual Text Denoising with Masked Language Models

Recently, with the help of deep learning models, significant advances ha...
research
12/07/2020

Stylometry for Noisy Medieval Data: Evaluating Paul Meyer's Hagiographic Hypothesis

Stylometric analysis of medieval vernacular texts is still a significant...
research
08/27/2021

Deep learning models are not robust against noise in clinical text

Artificial Intelligence (AI) systems are attracting increasing interest ...
research
11/18/2020

EasyTransfer – A Simple and Scalable Deep Transfer Learning Platform for NLP Applications

The literature has witnessed the success of applying deep Transfer Learn...

Please sign up or login with your details

Forgot password? Click here to reset