Distilling Double Descent

02/13/2021
by   Andrew Cotter, et al.
0

Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with soft labels, probabilities or confidences, from the teacher model. In this work, we show, that, even when the teacher model is highly overparameterized, and provides hard labels, using a very large held-out unlabeled dataset to train the student model can result in a model that outperforms more "traditional" approaches. Our explanation for this phenomenon is based on recent work on "double descent". It has been observed that, once a model's complexity roughly exceeds the amount required to memorize the training data, increasing the complexity further can, counterintuitively, result in better generalization. Researchers have identified several settings in which it takes place, while others have made various attempts to explain it (thus far, with only partial success). In contrast, we avoid these questions, and instead seek to exploit this phenomenon by demonstrating that a highly-overparameterized teacher can avoid overfitting via double descent, while a student trained on a larger independent dataset labeled by this teacher will avoid overfitting due to the size of its training set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2020

Subclass Distillation

After a large "teacher" neural network has been trained on labeled data,...
research
10/13/2022

Weighted Distillation with Unlabeled Examples

Distillation with unlabeled examples is a popular and powerful method fo...
research
03/06/2023

Students Parrot Their Teachers: Membership Inference on Model Distillation

Model distillation is frequently proposed as a technique to reduce the p...
research
11/11/2022

Continuous Soft Pseudo-Labeling in ASR

Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently...
research
02/16/2023

Cross Modal Distillation for Flood Extent Mapping

The increasing intensity and frequency of floods is one of the many cons...
research
02/10/2020

FAU, Facial Expressions, Valence and Arousal: A Multi-task Solution

In the paper, we aim to train a unified model that performs three tasks:...
research
08/22/2018

Approximation Trees: Statistical Stability in Model Distillation

This paper examines the stability of learned explanations for black-box ...

Please sign up or login with your details

Forgot password? Click here to reset