Efficient Kernel Transfer in Knowledge Distillation

by   Qi Qian, et al.

Knowledge distillation is an effective way for model compression in deep learning. Given a large model (i.e., teacher model), it aims to improve the performance of a compact model (i.e., student model) by transferring the information from the teacher. An essential challenge in knowledge distillation is to identify the appropriate information to transfer. In early works, only the final output of the teacher model is used as the soft label to help the training of student models. Recently, the information from intermediate layers is also adopted for better distillation. In this work, we aim to optimize the process of knowledge distillation from the perspective of kernel matrix. The output of each layer in a neural network can be considered as a new feature space generated by applying a kernel function on original images. Hence, we propose to transfer the corresponding kernel matrix (i.e., Gram matrix) from teacher models to student models for distillation. However, the size of the whole kernel matrix is quadratic to the number of examples. To improve the efficiency, we decompose the original kernel matrix with Nyström method and then transfer the partial matrix obtained with landmark points, whose size is linear in the number of examples. More importantly, our theoretical analysis shows that the difference between the original kernel matrices of teacher and student can be well bounded by that of their corresponding partial matrices. Finally, a new strategy of generating appropriate landmark points is proposed for better distillation. The empirical study on benchmark data sets demonstrates the effectiveness of the proposed algorithm. Code will be released.


page 1

page 2

page 3

page 4


Learning Student-Friendly Teacher Networks for Knowledge Distillation

We propose a novel knowledge distillation approach to facilitate the tra...

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Knowledge distillation is a strategy of training a student network with ...

Kernel Distillation for Gaussian Processes

Gaussian processes (GPs) are flexible models that can capture complex st...

Adam: Dense Retrieval Distillation with Adaptive Dark Examples

To improve the performance of the dual-encoder retriever, one effective ...

Data-Free Knowledge Distillation with Soft Targeted Transfer Set Synthesis

Knowledge distillation (KD) has proved to be an effective approach for d...

Parameter-Efficient and Student-Friendly Knowledge Distillation

Knowledge distillation (KD) has been extensively employed to transfer th...

Spherical Knowledge Distillation

Knowledge distillation aims at obtaining a small but effective deep mode...

Please sign up or login with your details

Forgot password? Click here to reset