ALM-KD: Knowledge Distillation with noisy labels via adaptive loss mixing
Knowledge distillation is a technique where the outputs of a pretrained model, often known as the teacher model is used for training a student model in a supervised setting. The teacher model outputs being a richer distribution over labels should improve the student model's performance as opposed to training with the usual hard labels. However, the label distribution imposed by the logits of the teacher network may not be always informative and may lead to poor student performance. We tackle this problem via the use of an adaptive loss mixing scheme during KD. Specifically, our method learns an instance-specific convex combination of the teacher-matching and label supervision objectives, using meta learning on a validation metric signalling to the student `how much' of KD is to be used. Through a range of experiments on controlled synthetic data and real-world datasets, we demonstrate performance gains obtained using our approach in the standard KD setting as well as in multi-teacher and self-distillation settings.
READ FULL TEXT