InforMask: Unsupervised Informative Masking for Language Model Pretraining

10/21/2022
by   Nafis Sadeq, et al.
0

Masked language modeling is widely used for pretraining large language models for natural language understanding (NLU). However, random masking is suboptimal, allocating an equal masking rate for all tokens. In this paper, we propose InforMask, a new unsupervised masking strategy for training masked language models. InforMask exploits Pointwise Mutual Information (PMI) to select the most informative tokens to mask. We further propose two optimizations for InforMask to improve its efficiency. With a one-off preprocessing step, InforMask outperforms random masking and previously proposed masking strategies on the factual recall benchmark LAMA and the question answering benchmark SQuAD v1 and v2.

READ FULL TEXT
research
04/04/2023

Unsupervised Improvement of Factual Knowledge in Language Models

Masked language modeling (MLM) plays a key role in pretraining large lan...
research
07/29/2023

GeneMask: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Large-scale language models such as DNABert and LOGO aim to learn optima...
research
12/10/2022

Uniform Masking Prevails in Vision-Language Pretraining

Masked Language Modeling (MLM) has proven to be an essential component o...
research
03/24/2023

Accelerating Vision-Language Pretraining with Free Language Modeling

The state of the arts in vision-language pretraining (VLP) achieves exem...
research
10/05/2020

PMI-Masking: Principled masking of correlated spans

Masking tokens uniformly at random constitutes a common flaw in the pret...
research
02/16/2021

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

We present COCO-LM, a new self-supervised learning framework that pretra...
research
05/24/2023

Self-Evolution Learning for Discriminative Language Model Pretraining

Masked language modeling, widely used in discriminative language model (...

Please sign up or login with your details

Forgot password? Click here to reset