Efficient Kernel Transfer in Knowledge Distillation

09/30/2020
by   Qi Qian, et al.
77

Knowledge distillation is an effective way for model compression in deep learning. Given a large model (i.e., teacher model), it aims to improve the performance of a compact model (i.e., student model) by transferring the information from the teacher. An essential challenge in knowledge distillation is to identify the appropriate information to transfer. In early works, only the final output of the teacher model is used as the soft label to help the training of student models. Recently, the information from intermediate layers is also adopted for better distillation. In this work, we aim to optimize the process of knowledge distillation from the perspective of kernel matrix. The output of each layer in a neural network can be considered as a new feature space generated by applying a kernel function on original images. Hence, we propose to transfer the corresponding kernel matrix (i.e., Gram matrix) from teacher models to student models for distillation. However, the size of the whole kernel matrix is quadratic to the number of examples. To improve the efficiency, we decompose the original kernel matrix with Nyström method and then transfer the partial matrix obtained with landmark points, whose size is linear in the number of examples. More importantly, our theoretical analysis shows that the difference between the original kernel matrices of teacher and student can be well bounded by that of their corresponding partial matrices. Finally, a new strategy of generating appropriate landmark points is proposed for better distillation. The empirical study on benchmark data sets demonstrates the effectiveness of the proposed algorithm. Code will be released.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/12/2021

Learning Student-Friendly Teacher Networks for Knowledge Distillation

We propose a novel knowledge distillation approach to facilitate the tra...
research
10/20/2020

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Knowledge distillation is a strategy of training a student network with ...
research
01/31/2018

Kernel Distillation for Gaussian Processes

Gaussian processes (GPs) are flexible models that can capture complex st...
research
12/20/2022

Adam: Dense Retrieval Distillation with Adaptive Dark Examples

To improve the performance of the dual-encoder retriever, one effective ...
research
04/10/2021

Data-Free Knowledge Distillation with Soft Targeted Transfer Set Synthesis

Knowledge distillation (KD) has proved to be an effective approach for d...
research
05/28/2022

Parameter-Efficient and Student-Friendly Knowledge Distillation

Knowledge distillation (KD) has been extensively employed to transfer th...
research
10/15/2020

Spherical Knowledge Distillation

Knowledge distillation aims at obtaining a small but effective deep mode...

Please sign up or login with your details

Forgot password? Click here to reset