Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

by   Yuanxin Liu, et al.

Recently, knowledge distillation (KD) has shown great success in BERT compression. Instead of only learning from the teacher's soft label as in conventional KD, researchers find that the rich information contained in the hidden layers of BERT is conducive to the student's performance. To better exploit the hidden knowledge, a common practice is to force the student to deeply mimic the teacher's hidden states of all the tokens in a layer-wise manner. In this paper, however, we observe that although distilling the teacher's hidden state knowledge (HSK) is helpful, the performance gain (marginal utility) diminishes quickly as more HSK is distilled. To understand this effect, we conduct a series of analysis. Specifically, we divide the HSK of BERT into three dimensions, namely depth, length and width. We first investigate a variety of strategies to extract crucial knowledge for each single dimension and then jointly compress the three dimensions. In this way, we show that 1) the student's performance can be improved by extracting and distilling the crucial HSK, and 2) using a tiny fraction of HSK can achieve the same performance as extensive HSK distillation. Based on the second finding, we further propose an efficient KD paradigm to compress BERT, which does not require loading the teacher during the training of student. For two kinds of student models and computing devices, the proposed KD paradigm gives rise to training speedup of 2.7x   3.4x.


page 2

page 4

page 5

page 7

page 8

page 9

page 10

page 13


Patient Knowledge Distillation for BERT Model Compression

Pre-trained language models such as BERT have proven to be highly effect...

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Pre-trained language models (PLMs) achieve great success in NLP. However...

On student-teacher deviations in distillation: does it pay to disobey?

Knowledge distillation has been widely-used to improve the performance o...

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Recently, various intermediate layer distillation (ILD) objectives have ...

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Layer-wise distillation is a powerful tool to compress large models (i.e...

Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

Knowledge distillation field delicately designs various types of knowled...

An Efficient Federated Distillation Learning System for Multi-task Time Series Classification

This paper proposes an efficient federated distillation learning system ...

Please sign up or login with your details

Forgot password? Click here to reset