Self-Evolution Learning for Discriminative Language Model Pretraining

05/24/2023
by   Qihuang Zhong, et al.
0

Masked language modeling, widely used in discriminative language model (e.g., BERT) pretraining, commonly adopts a random masking strategy. However, random masking does not consider the importance of the different words in the sentence meaning, where some of them are more worthy to be predicted. Therefore, various masking strategies (e.g., entity-level masking) are proposed, but most of them require expensive prior knowledge and generally train from scratch without reusing existing model weights. In this paper, we present Self-Evolution learning (SE), a simple and effective token masking and learning method to fully and wisely exploit the knowledge from data. SE focuses on learning the informative yet under-explored tokens and adaptively regularizes the training by introducing a novel Token-specific Label Smoothing approach. Experiments on 10 tasks show that our SE brings consistent and significant improvements (+1.43 2.12 average scores) upon different PLMs. In-depth analyses demonstrate that SE improves linguistic knowledge learning and generalization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/04/2021

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Masked language modeling (MLM), a self-supervised pretraining objective,...
research
05/24/2023

Revisiting Token Dropping Strategy in Efficient BERT Pretraining

Token dropping is a recently-proposed strategy to speed up the pretraini...
research
10/21/2022

InforMask: Unsupervised Informative Masking for Language Model Pretraining

Masked language modeling is widely used for pretraining large language m...
research
12/04/2022

Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE

This technical report briefly describes our JDExplore d-team's Vega v2 s...
research
02/16/2021

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

We present COCO-LM, a new self-supervised learning framework that pretra...
research
11/17/2022

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

Large-scale transformer models have become the de-facto architectures fo...
research
10/05/2020

PMI-Masking: Principled masking of correlated spans

Masking tokens uniformly at random constitutes a common flaw in the pret...

Please sign up or login with your details

Forgot password? Click here to reset