Stable and low-precision training for large-scale vision-language models

04/25/2023
by   Mitchell Wortsman, et al.
0

We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) Towards accelerating training, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25 0.1 percentage points for the 1B parameter CLIP ViT-Huge – the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) Towards stable training, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid, which we refer to as StableAdamW because it avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2022

Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models

Pre-trained language models (LMs) are shown to easily generate toxic lan...
research
04/19/2023

A Theory on Adam Instability in Large-Scale Machine Learning

We present a theory for the previously unexplained divergent behavior no...
research
03/24/2023

Accelerating Vision-Language Pretraining with Free Language Modeling

The state of the arts in vision-language pretraining (VLP) achieves exem...
research
05/27/2023

The Curse of Recursion: Training on Generated Data Makes Models Forget

Stable Diffusion revolutionised image creation from descriptive text. GP...
research
06/16/2023

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of ...
research
06/04/2023

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Reinforcement learning from human feedback (RLHF) has emerged as a relia...
research
06/25/2020

MTAdam: Automatic Balancing of Multiple Training Loss Terms

When training neural models, it is common to combine multiple loss terms...

Please sign up or login with your details

Forgot password? Click here to reset