Exploring the Connection between Knowledge Distillation and Logits Matching
Knowledge distillation is a generalized logits matching technique for model compression. Their equivalence is previously established on the condition of infinity temperature and zero-mean normalization. In this paper, we prove that with only infinity temperature, the effect of knowledge distillation equals to logits matching with an extra regularization. Furthermore, we reveal that an additional weaker condition – equal-mean initialization rather than the original zero-mean normalization already suffices to set up the equivalence. The key to our proof is we realize that in modern neural networks with the cross-entropy loss and softmax activation, the mean of back-propagated gradient on logits always keeps zero.
READ FULL TEXT