How to Fine-Tune Vision Models with SGD

11/17/2022
by   Ananya Kumar, et al.
0

SGD (with momentum) and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we show that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: merely freezing the embedding layer (less than 1% of the parameters) leads to SGD performing competitively with AdamW while using less memory. Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, Living-17, Waterbirds, and DomainNet.

READ FULL TEXT
research
03/29/2022

Parameter-efficient Fine-tuning for Vision Transformers

In computer vision, it has achieved great success in adapting large-scal...
research
05/27/2023

Fine-Tuning Language Models with Just Forward Passes

Fine-tuning language models (LMs) has yielded success on diverse downstr...
research
05/14/2019

Robust Neural Network Training using Periodic Sampling over Model Weights

Deep neural networks provide best-in-class performance for a number of c...
research
02/14/2022

Orthogonalising gradients to speed up neural network optimisation

The optimisation of neural networks can be sped up by orthogonalising th...
research
03/29/2022

Fine-tuning Image Transformers using Learnable Memory

In this paper we propose augmenting Vision Transformer models with learn...
research
09/20/2022

Integer Fine-tuning of Transformer-based Models

Transformer based models are used to achieve state-of-the-art performanc...
research
02/02/2022

Robust Training of Neural Networks using Scale Invariant Architectures

In contrast to SGD, adaptive gradient methods like Adam allow robust tra...

Please sign up or login with your details

Forgot password? Click here to reset