SDBERT: SparseDistilBERT, a faster and smaller BERT model

by   Devaraju Vinoda, et al.

In this work we introduce a new transformer architecture called SparseDistilBERT (SDBERT), which is a combination of sparse attention and knowledge distillantion (KD). We implemented sparse attention mechanism to reduce quadratic dependency on input length to linear. In addition to reducing computational complexity of the model, we used knowledge distillation (KD). We were able to reduce the size of BERT model by 60 performance and it only took 40


Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

The use of large transformer-based models such as BERT, GPT, and T5 has ...

Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models

In this paper, we explore the knowledge distillation approach under the ...

Big Bird: Transformers for Longer Sequences

Transformers-based models, such as BERT, have been one of the most succe...

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Limited computational budgets often prevent transformers from being used...

Using Prior Knowledge to Guide BERT's Attention in Semantic Textual Matching Tasks

We study the problem of incorporating prior knowledge into a deep Transf...

FAN-Trans: Online Knowledge Distillation for Facial Action Unit Detection

Due to its importance in facial behaviour analysis, facial action unit (...

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Language model probing is often used to test specific capabilities of th...

Please sign up or login with your details

Forgot password? Click here to reset