Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation

02/25/2021
by   Kenneth Borup, et al.
0

Knowledge distillation is classically a procedure where a neural network is trained on the output of another network along with the original targets in order to transfer knowledge between the architectures. The special case of self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy. In this paper, we consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets. This allows us to provide the first theoretical results on the importance of using the weighted ground-truth targets in self-distillation. Our focus is on fitting nonlinear functions to training data with a weighted mean square error objective function suitable for distillation, subject to ℓ_2 regularization of the model parameters. We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization. Finally, we examine empirically, both in a regression setting and with ResNet networks, how the choice of weighting parameter influences the generalization performance after self-distillation.

READ FULL TEXT
research
02/13/2020

Self-Distillation Amplifies Regularization in Hilbert Space

Knowledge distillation introduced in the deep learning context is a meth...
research
03/30/2020

On the Unreasonable Effectiveness of Knowledge Distillation: Analysis in the Kernel Regime

Knowledge distillation (KD), i.e. one classifier being trained on the ou...
research
12/01/2020

Solvable Model for Inheriting the Regularization through Knowledge Distillation

In recent years the empirical success of transfer learning with neural n...
research
10/02/2019

Distillation ≈ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network

Distillation is a method to transfer knowledge from one model to another...
research
05/27/2021

Towards Understanding Knowledge Distillation

Knowledge distillation, i.e., one classifier being trained on the output...
research
09/19/2020

Introspective Learning by Distilling Knowledge from Online Self-explanation

In recent years, many explanation methods have been proposed to explain ...
research
03/30/2022

Self-Distillation from the Last Mini-Batch for Consistency Regularization

Knowledge distillation (KD) shows a bright promise as a powerful regular...

Please sign up or login with your details

Forgot password? Click here to reset