Local Region Knowledge Distillation
Knowledge distillation (KD) is an effective technique to transfer knowledge from one neural network (teacher) to another (student), thus improving the performance of the student. The existing work trains the student to mimic the outputs of the teacher on training data. We argue that transferring knowledge at sparse training data points cannot enable the student to well capture the local shape of the teacher's function. To address this issue, we propose locally linear region knowledge distillation (L^2RKD) which transfers the knowledge in local, liner regions from a teacher to a student. L^2RKD enforces the student to mimic the local shape of the teacher function in linear regions. Extensive experiments with various network architectures demonstrate that L^2RKD outperforms the state-of-the-art approaches by a large margin and is more data-efficient. Moreover, L^2RKD is compatible with the existing distillation methods and further improves their performances significantly.
READ FULL TEXT