Efficient Knowledge Distillation from Model Checkpoints

10/12/2022
by   Chaofei Wang, et al.
0

Knowledge distillation is an effective approach to learn compact models (students) with the supervision of large and strong models (teachers). As empirically there exists a strong correlation between the performance of teacher and student models, it is commonly believed that a high performing teacher is preferred. Consequently, practitioners tend to use a well trained network or an ensemble of them as the teacher. In this paper, we make an intriguing observation that an intermediate model, i.e., a checkpoint in the middle of the training procedure, often serves as a better teacher compared to the fully converged model, although the former has much lower accuracy. More surprisingly, a weak snapshot ensemble of several intermediate models from a same training trajectory can outperform a strong ensemble of independently trained and fully converged models, when they are used as teachers. We show that this phenomenon can be partially explained by the information bottleneck principle: the feature representations of intermediate models can have higher mutual information regarding the input, and thus contain more "dark knowledge" for effective distillation. We further propose an optimal intermediate teacher selection algorithm based on maximizing the total task-related mutual information. Experiments verify its effectiveness and applicability.

READ FULL TEXT
research
02/25/2022

Learn From the Past: Experience Ensemble Knowledge Distillation

Traditional knowledge distillation transfers "dark knowledge" of a pre-t...
research
10/29/2021

Estimating and Maximizing Mutual Information for Knowledge Distillation

In this work, we propose Mutual Information Maximization Knowledge Disti...
research
10/03/2019

On the Efficacy of Knowledge Distillation

In this paper, we present a thorough evaluation of the efficacy of knowl...
research
09/23/2021

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

We aim to identify how different components in the KD pipeline affect th...
research
05/23/2023

Transferring Learning Trajectories of Neural Networks

Training deep neural networks (DNNs) is computationally expensive, which...
research
07/05/2023

Exploring new ways: Enforcing representational dissimilarity to learn new features and reduce error consistency

Independently trained machine learning models tend to learn similar feat...
research
02/20/2023

Progressive Knowledge Distillation: Building Ensembles for Efficient Inference

We study the problem of progressive distillation: Given a large, pre-tra...

Please sign up or login with your details

Forgot password? Click here to reset