Integer Fine-tuning of Transformer-based Models

09/20/2022
by   Mohammadreza Tayaranian, et al.
0

Transformer based models are used to achieve state-of-the-art performance on various deep learning tasks. Since transformer-based models have large numbers of parameters, fine-tuning them on downstream tasks is computationally intensive and energy hungry. Automatic mixed-precision FP32/FP16 fine-tuning of such models has been previously used to lower the compute resource requirements. However, with the recent advances in the low-bit integer back-propagation, it is possible to further reduce the computation and memory foot-print. In this work, we explore a novel integer training method that uses integer arithmetic for both forward propagation and gradient computation of linear, convolutional, layer-norm, and embedding layers in transformer-based models. Furthermore, we study the effect of various integer bit-widths to find the minimum required bit-width for integer fine-tuning of transformer-based models. We fine-tune BERT and ViT models on popular downstream tasks using integer layers. We show that 16-bit integer models match the floating-point baseline performance. Reducing the bit-width to 10, we observe 0.5 average score drop. Finally, further reduction of the bit-width to 8 provides an average score drop of 1.7 points.

READ FULL TEXT
research
01/05/2021

I-BERT: Integer-only BERT Quantization

Transformer based models, like BERT and RoBERTa, have achieved state-of-...
research
05/29/2023

SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics

Transformer-based models, such as BERT and ViT, have achieved state-of-t...
research
10/26/2022

Exploring Robustness of Prefix Tuning in Noisy Data: A Case Study in Financial Sentiment Analysis

The invention of transformer-based models such as BERT, GPT, and RoBERTa...
research
01/10/2022

TiltedBERT: Resource Adjustable Version of BERT

In this paper, we proposed a novel adjustable finetuning method that imp...
research
02/15/2023

The Expressive Power of Tuning Only the Norm Layers

Feature normalization transforms such as Batch and Layer-Normalization h...
research
11/17/2022

How to Fine-Tune Vision Models with SGD

SGD (with momentum) and AdamW are the two most used optimizers for fine-...
research
02/07/2023

Transformer-based Models for Long-Form Document Matching: Challenges and Empirical Analysis

Recent advances in the area of long document matching have primarily foc...

Please sign up or login with your details

Forgot password? Click here to reset