Self-training Improves Pre-training for Natural Language Understanding

10/05/2020
by   Jingfei Du, et al.
0

Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6 Finally, we also show strong gains on knowledge-distillation and few-shot learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2021

Task-adaptive Pre-training and Self-training are Complementary for Natural Language Understanding

Task-adaptive pre-training (TAPT) and Self-training (ST) have emerged as...
research
06/11/2021

Generate, Annotate, and Learn: Generative Models Advance Self-Training and Knowledge Distillation

Semi-Supervised Learning (SSL) has seen success in many application doma...
research
10/09/2019

Efficient Semi-Supervised Learning for Natural Language Understanding by Optimizing Diversity

Expanding new functionalities efficiently is an ongoing challenge for si...
research
04/26/2020

Dual Learning for Semi-Supervised Natural Language Understanding

Natural language understanding (NLU) converts sentences into structured ...
research
12/16/2016

Machine Reading with Background Knowledge

Intelligent systems capable of automatically understanding natural langu...
research
03/29/2021

Industry Scale Semi-Supervised Learning for Natural Language Understanding

This paper presents a production Semi-Supervised Learning (SSL) pipeline...
research
11/13/2018

Unsupervised Transfer Learning for Spoken Language Understanding in Intelligent Agents

User interaction with voice-powered agents generates large amounts of un...

Please sign up or login with your details

Forgot password? Click here to reset