PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices

11/25/2022
by   Kazuki Osawa, et al.
0

Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distributed accelerator clusters. Yet, pipeline bubbles during startup and tear-down reduce the utilization of accelerators. Although efficient pipeline schemes with micro-batching and bidirectional pipelines have been proposed to maximize utilization, a significant number of bubbles cannot be filled using synchronous forward and backward passes. To address this problem, we suggest that extra work be assigned to the bubbles to gain auxiliary benefits in LLM training. As an example in this direction, we propose PipeFisher, which assigns the work of K-FAC, a second-order optimization method based on the Fisher information matrix, to the bubbles to accelerate convergence. In Phase 1 pretraining of BERT-Base and -Large models, PipeFisher reduces the (simulated) training time to 50-75 first-order optimizer by greatly improving the accelerator utilization and benefiting from the improved convergence by K-FAC.

READ FULL TEXT

page 2

page 6

page 7

page 8

page 16

page 17

page 18

page 19

research
07/14/2021

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

Training large deep learning models at scale is very challenging. This p...
research
04/13/2022

METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

We present an efficient method of pretraining large-scale autoencoding l...
research
06/16/2020

Memory-Efficient Pipeline-Parallel DNN Training

Many state-of-the-art results in domains such as NLP and computer vision...
research
06/02/2023

MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates

This work proposes a Momentum-Enabled Kronecker-Factor-Based Optimizer U...
research
06/04/2021

Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models

The advent of the transformer has sparked a quick growth in the size of ...
research
04/22/2022

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emer...
research
06/27/2023

SparseOptimizer: Sparsify Language Models through Moreau-Yosida Regularization and Accelerate via Compiler Co-design

This paper introduces SparseOptimizer, a novel deep learning optimizer t...

Please sign up or login with your details

Forgot password? Click here to reset