Augmentor or Filter? Reconsider the Role of Pre-trained Language Model in Text Classification Augmentation

10/06/2022
by   Heng Yang, et al.
0

Text augmentation is one of the most effective techniques to solve the critical problem of insufficient data in text classification. Existing text augmentation methods achieve hopeful performance in few-shot text data augmentation. However, these methods usually lead to performance degeneration on public datasets due to poor quality augmentation instances. Our study shows that even employing pre-trained language models, existing text augmentation methods generate numerous low-quality instances and lead to the feature space shift problem in augmentation instances. However, we note that the pre-trained language model is good at finding low-quality instances provided that it has been fine-tuned on the target dataset. To alleviate the feature space shift and performance degeneration in existing text augmentation methods, we propose BOOSTAUG, which reconsiders the role of the language model in text augmentation and emphasizes the augmentation instance filtering rather than generation. We evaluate BOOSTAUG on both sentence-level text classification and aspect-based sentiment classification. The experimental results on seven commonly used text classification datasets show that our augmentation method obtains state-of-the-art performance. Moreover, BOOSTAUG is a flexible framework; we release the code which can help improve existing augmentation methods.

READ FULL TEXT
research
12/04/2022

MiLMo:Minority Multilingual Pre-trained Language Model

Pre-trained language models are trained on large-scale unsupervised data...
research
02/13/2023

Towards Agile Text Classifiers for Everyone

Text-based safety classifiers are widely used for content moderation and...
research
02/28/2022

Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

Before entering the neural network, a token is generally converted to th...
research
10/31/2021

PnPOOD : Out-Of-Distribution Detection for Text Classification via Plug andPlay Data Augmentation

While Out-of-distribution (OOD) detection has been well explored in comp...
research
04/04/2023

A Data Fusion Framework for Multi-Domain Morality Learning

Language models can be trained to recognize the moral sentiment of text,...
research
10/05/2022

Token Classification for Disambiguating Medical Abbreviations

Abbreviations are unavoidable yet critical parts of the medical text. Us...
research
02/09/2021

AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation

Attention-based pre-trained language models such as GPT-2 brought consid...

Please sign up or login with your details

Forgot password? Click here to reset