GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation

by   Biyang Guo, et al.

We introduce GENIUS: a conditional text generation model using sketches as input, which can fill in the missing contexts for a given sketch (key information consisting of textual spans, phrases, or words, concatenated by mask tokens). GENIUS is pre-trained on a large-scale textual corpus with a novel reconstruction from sketch objective using an extreme and selective masking strategy, enabling it to generate diverse and high-quality texts given sketches. Comparison with other competitive conditional language models (CLMs) reveals the superiority of GENIUS's text generation quality. We further show that GENIUS can be used as a strong and ready-to-use data augmentation tool for various natural language processing (NLP) tasks. Most existing textual data augmentation methods are either too conservative, by making small changes to the original text, or too aggressive, by creating entirely new samples. With GENIUS, we propose GeniusAug, which first extracts the target-aware sketches from the original training set and then generates new samples based on the sketches. Empirical experiments on 6 text classification datasets show that GeniusAug significantly improves the models' performance in both in-distribution (ID) and out-of-distribution (OOD) settings. We also demonstrate the effectiveness of GeniusAug on named entity recognition (NER) and machine reading comprehension (MRC) tasks. (Code and models are publicly available at and


page 1

page 2

page 3

page 4


iNLTK: Natural Language Toolkit for Indic Languages

We present iNLTK, an open-source NLP library consisting of pre-trained l...

BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER

Biomedical Named Entity Recognition (BioNER) is the fundamental task of ...

ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER

Complex Named Entity Recognition (NER) is the task of detecting linguist...

EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance Text Classification

Recent works have empirically shown the effectiveness of data augmentati...

SSMix: Saliency-Based Span Mixup for Text Classification

Data augmentation with mixup has shown to be effective on various comput...

Entry Separation using a Mixed Visual and Textual Language Model: Application to 19th century French Trade Directories

When extracting structured data from repetitively organized documents, s...

Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference

Large Language Models (LLMs) like GPT-3 have sparked significant interes...

Please sign up or login with your details

Forgot password? Click here to reset