LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

by   Yihuan Mao, et al.

BERT is a cutting-edge language representation model pre-trained by a large corpus, which achieves superior performances on various natural language understanding tasks. However, a major blocking issue of applying BERT to online services is that it is memory-intensive and leads to unsatisfactory latency of user requests, raising the necessity of model compression. Existing solutions leverage the knowledge distillation framework to learn a smaller model that imitates the behaviors of BERT. However, the training procedure of knowledge distillation is expensive itself as it requires sufficient training data to imitate the teacher model. In this paper, we address this issue by proposing a hybrid solution named LadaBERT (Lightweight adaptation of BERT through hybrid model compression), which combines the advantages of different model compression methods, including weight pruning, matrix factorization and knowledge distillation. LadaBERT achieves state-of-the-art accuracy on various public datasets while the training overheads can be reduced by an order of magnitude.


LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

The pre-training models such as BERT have achieved great results in vari...

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

The use of large transformer-based models such as BERT, GPT, and T5 has ...

Weight Squeezing: Reparameterization for Compression and Fast Inference

In this work, we present a novel approach for simultaneous knowledge tra...

BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

In this paper, we propose a novel model compression approach to effectiv...

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Extreme compression, particularly ultra-low bit precision (binary/ternar...

Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

The latency of neural ranking models at query time is largely dependent ...

Please sign up or login with your details

Forgot password? Click here to reset