SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

11/26/2022
by   Zixiang Ding, et al.
0

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner. Sampling distribution plays an important role in SKD. We heuristically present three types of sampling distributions to assign appropriate probabilities for multi-level teacher models. SKD has two advantages: 1) it can preserve the diversities of multi-level teacher models via stochastically sampling single teacher model in each iteration, and 2) it can also improve the efficacy of knowledge distillation via multi-level teacher models when large capacity gap exists between the teacher model and the student model. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT_ BASE model by 40 language understanding and being 100

READ FULL TEXT

page 5

page 6

page 7

page 14

page 15

page 16

research
03/06/2021

Adaptive Multi-Teacher Multi-level Knowledge Distillation

Knowledge distillation (KD) is an effective learning paradigm for improv...
research
05/29/2022

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Driven by the teacher-student paradigm, knowledge distillation is one of...
research
11/21/2022

Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text

Self-supervised representation learning has proved to be a valuable comp...
research
05/24/2022

CDFKD-MFS: Collaborative Data-free Knowledge Distillation via Multi-level Feature Sharing

Recently, the compression and deployment of powerful deep neural network...
research
11/02/2022

Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model

Pre-trained multilingual language models play an important role in cross...
research
03/02/2023

Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

Deep speaker models yield low error rates in speaker verification. Nonet...

Please sign up or login with your details

Forgot password? Click here to reset