Empirical Analysis of Knowledge Distillation Technique for Optimization of Quantized Deep Neural Networks
Knowledge distillation (KD) is a very popular method for model size reduction. Recently, the technique is exploited for quantized deep neural networks (QDNNs) training as a way to restore the performance sacrificed by word-length reduction. KD, however, employs additional hyper-parameters, such as temperature, coefficient, and the size of teacher network for QDNN training. We analyze the effect of these hyper-parameters for QDNN optimization with KD. We find that these hyper-parameters are inter-related, and also introduce a simple and effective technique that reduces coefficient during training. With KD employing the proposed hyper-parameters, we achieve the test accuracy of 92.7 and CIFAR-100 data sets, respectively.
READ FULL TEXT