Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data

10/04/2019
by   Subhabrata Mukherjee, et al.
0

Recent advances in pre-training huge models on large amounts of text through self supervision have obtained state-of-the-art results in various natural language processing tasks. However, these huge and expensive models are difficult to use in practise for downstream tasks. Some recent efforts use knowledge distillation to compress these models. However, we see a gap between the performance of the smaller student models as compared to that of the large teacher. In this work, we leverage large amounts of in-domain unlabeled transfer data in addition to a limited amount of labeled training instances to bridge this gap. We show that simple RNN based student models even with hard distillation can perform at par with the huge teachers given the transfer set. The student performance can be further improved with soft distillation and leveraging teacher intermediate representations. We show that our student models can compress the huge teacher by up to 26x while still matching or even marginally exceeding the teacher performance in low-resource settings with small amount of labeled data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/09/2023

ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic Distillation Generalization

Task-agnostic knowledge distillation attempts to address the problem of ...
research
03/10/2021

Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones

Recently, research efforts have been concentrated on revealing how pre-t...
research
12/11/2020

Reinforced Multi-Teacher Selection for Knowledge Distillation

In natural language processing (NLP) tasks, slow inference speed and hug...
research
04/12/2020

TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

Deep and large pre-trained language models are the state-of-the-art for ...
research
01/25/2020

Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings

Over the past year, the emergence of transfer learning with large-scale ...
research
10/13/2022

Weighted Distillation with Unlabeled Examples

Distillation with unlabeled examples is a popular and powerful method fo...
research
07/16/2021

Representation Consolidation for Training Expert Students

Traditionally, distillation has been used to train a student model to em...

Please sign up or login with your details

Forgot password? Click here to reset