Distribution Shift Matters for Knowledge Distillation with Webly Collected Images

by   Jialiang Tang, et al.
Nanjing University

Knowledge distillation aims to learn a lightweight student network from a pre-trained teacher network. In practice, existing knowledge distillation methods are usually infeasible when the original training data is unavailable due to some privacy issues and data management considerations. Therefore, data-free knowledge distillation approaches proposed to collect training instances from the Internet. However, most of them have ignored the common distribution shift between the instances from original training data and webly collected data, affecting the reliability of the trained student network. To solve this problem, we propose a novel method dubbed “Knowledge Distillation between Different Distributions" (KD^3), which consists of three components. Specifically, we first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network. Subsequently, we align both the weighted features and classifier parameters of the two networks for knowledge memorization. Meanwhile, we also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment, so that the student network can further learn a distribution-invariant representation. Intensive experiments on various benchmark datasets demonstrate that our proposed KD^3 can outperform the state-of-the-art data-free knowledge distillation approaches.


page 1

page 2

page 3

page 4


Zero-Shot Knowledge Distillation in Deep Networks

Knowledge distillation deals with the problem of training a smaller mode...

How to Teach: Learning Data-Free Knowledge Distillation from Curriculum

Data-free knowledge distillation (DFKD) aims at training lightweight stu...

Few Shot Network Compression via Cross Distillation

Model compression has been widely adopted to obtain light-weighted deep ...

On-Device Domain Generalization

We present a systematic study of domain generalization (DG) for tiny neu...

Imitation networks: Few-shot learning of neural networks from scratch

In this paper, we propose imitation networks, a simple but effective met...

Web Content Filtering through knowledge distillation of Large Language Models

We introduce a state-of-the-art approach for URL categorization that lev...

Up to 100x Faster Data-free Knowledge Distillation

Data-free knowledge distillation (DFKD) has recently been attracting inc...

Please sign up or login with your details

Forgot password? Click here to reset