SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

by   Zixiang Ding, et al.
NetEase, Inc

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner. Sampling distribution plays an important role in SKD. We heuristically present three types of sampling distributions to assign appropriate probabilities for multi-level teacher models. SKD has two advantages: 1) it can preserve the diversities of multi-level teacher models via stochastically sampling single teacher model in each iteration, and 2) it can also improve the efficacy of knowledge distillation via multi-level teacher models when large capacity gap exists between the teacher model and the student model. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT_ BASE model by 40 language understanding and being 100


page 5

page 6

page 7

page 14

page 15

page 16


Adaptive Multi-Teacher Multi-level Knowledge Distillation

Knowledge distillation (KD) is an effective learning paradigm for improv...

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Driven by the teacher-student paradigm, knowledge distillation is one of...

Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text

Self-supervised representation learning has proved to be a valuable comp...

CDFKD-MFS: Collaborative Data-free Knowledge Distillation via Multi-level Feature Sharing

Recently, the compression and deployment of powerful deep neural network...

Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model

Pre-trained multilingual language models play an important role in cross...

Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

Deep speaker models yield low error rates in speaker verification. Nonet...

Please sign up or login with your details

Forgot password? Click here to reset