Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

by   Baohao Liao, et al.

Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant with PEFT is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method. With this finding, we propose memory-efficient fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the activation memory up to 84 trainable parameters. Moreover, MEFT achieves the same score on GLUE and a comparable score on the question-answering tasks as full fine-tuning.


page 1

page 2

page 3

page 4


LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

Fine-tuning large pre-trained models on downstream tasks has been adopte...

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

The low-rank adaptation (LoRA) method can largely reduce the amount of t...

AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning

With the rapid adoption of machine learning (ML), a number of domains no...

Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions

Pre-training fine-tuning is a prevalent paradigm in computer vision ...

PaReprop: Fast Parallelized Reversible Backpropagation

The growing size of datasets and deep learning models has made faster an...

Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

Temporal action localization (TAL) requires long-form reasoning to predi...

Memory-Efficient Backpropagation through Large Linear Layers

In modern neural networks like Transformers, linear layers require signi...

Please sign up or login with your details

Forgot password? Click here to reset