Fine-tuning can cripple your foundation model; preserving features may be the solution

by   Jishnu Mukhoti, et al.

Pre-trained foundation models, owing primarily to their enormous capacity and exposure to vast amount of training data scraped from the internet, enjoy the advantage of storing knowledge about plenty of real-world concepts. Such models are typically fine-tuned on downstream datasets to produce remarkable state-of-the-art performances. While various fine-tuning methods have been devised and are shown to be highly effective, we observe that a fine-tuned model's ability to recognize concepts on tasks different from the downstream one is reduced significantly compared to its pre-trained counterpart. This is clearly undesirable as a huge amount of time and money went into learning those very concepts in the first place. We call this undesirable phenomenon "concept forgetting" and via experiments show that most end-to-end fine-tuning approaches suffer heavily from this side effect. To this end, we also propose a rather simple fix to this problem by designing a method called LDIFS (short for ℓ_2 distance in feature space) that simply preserves the features of the original foundation model during fine-tuning. We show that LDIFS significantly reduces concept forgetting without having noticeable impact on the downstream task performance.


page 1

page 2

page 3

page 4


Preserving Pre-trained Features Helps Calibrate Fine-tuned Language Models

Large pre-trained language models (PLMs) have demonstrated strong perfor...

Supervised Fine-tuning Evaluation for Long-term Visual Place Recognition

In this paper, we present a comprehensive study on the utility of deep c...

Offsite-Tuning: Transfer Learning without Full Model

Transfer learning is important for foundation models to adapt to downstr...

Watermarking Pre-trained Language Models with Backdooring

Large pre-trained language models (PLMs) have proven to be a crucial com...

Learning without Forgetting

When building a unified vision system or gradually adding new capabiliti...

Fine-Tuning Graph Neural Networks via Graph Topology induced Optimal Transport

Recently, the pretrain-finetuning paradigm has attracted tons of attenti...

Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding

Dataset bias has attracted increasing attention recently for its detrime...

Please sign up or login with your details

Forgot password? Click here to reset