Distilling Linguistic Context for Language Model Compression

09/17/2021
by   Geondo Park, et al.
0

A computationally expensive and memory intensive neural network lies behind the recent success of language representation learning. Knowledge distillation, a major technique for deploying such a vast language model in resource-scarce environments, transfers the knowledge on individual word representations learned without restrictions. In this paper, inspired by the recent observations that language representations are relatively positioned and have more semantic knowledge as a whole, we present a new knowledge distillation objective for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. Unlike other recent distillation techniques for the language models, our contextual distillation does not have any restrictions on architectural changes between teacher and student. We validate the effectiveness of our method on challenging benchmarks of language understanding tasks, not only in architectures of various sizes, but also in combination with DynaBERT, the recently proposed adaptive size pruning method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/25/2019

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Pre-trained deep neural network language models such as ELMo, GPT, BERT ...
research
05/29/2022

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Driven by the teacher-student paradigm, knowledge distillation is one of...
research
09/15/2022

Layerwise Bregman Representation Learning with Applications to Knowledge Distillation

In this work, we propose a novel approach for layerwise representation l...
research
10/26/2020

Activation Map Adaptation for Effective Knowledge Distillation

Model compression becomes a recent trend due to the requirement of deplo...
research
12/01/2021

Information Theoretic Representation Distillation

Despite the empirical success of knowledge distillation, there still lac...
research
01/12/2023

A Cohesive Distillation Architecture for Neural Language Models

A recent trend in Natural Language Processing is the exponential growth ...
research
05/10/2022

From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective

Neural retrievers based on dense representations combined with Approxima...

Please sign up or login with your details

Forgot password? Click here to reset