Pipelined Backpropagation at Scale: Training Large Models without Batches

03/25/2020
by   Atli Kosson, et al.
16

Parallelism is crucial for accelerating the training of deep neural networks. Pipeline parallelism can provide an efficient alternative to traditional data parallelism by allowing workers to specialize. Performing mini-batch SGD using pipeline parallelism has the overhead of filling and draining the pipeline. Pipelined Backpropagation updates the model parameters without draining the pipeline. This removes the overhead but introduces stale gradients and inconsistency between the weights used on the forward and backward passes, reducing final accuracy and the stability of training. We introduce Spike Compensation and Linear Weight Prediction to mitigate these effects. Analysis on a convex quadratic shows that both methods effectively counteract staleness. We train multiple convolutional networks at a batch size of one, completely replacing batch parallelism with fine-grained pipeline parallelism. With our methods, Pipelined Backpropagation achieves full accuracy on CIFAR-10 and ImageNet without hyperparameter tuning.

READ FULL TEXT
research
11/11/2022

Breadth-First Pipeline Parallelism

We introduce Breadth-First Pipeline Parallelism, a novel training schedu...
research
12/07/2020

Parallel Training of Deep Networks with Local Updates

Deep learning models trained on large data sets have been widely success...
research
10/09/2019

PipeMare: Asynchronous Pipeline Parallel DNN Training

Recently there has been a flurry of interest around using pipeline paral...
research
11/16/2018

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

GPipe is a scalable pipeline parallelism library that enables learning o...
research
04/12/2020

Accelerating Filesystem Checking and Repair with pFSCK

File system checking and recovery (C/R) tools play a pivotal role in inc...
research
04/01/2021

Optimizer Fusion: Efficient Training with Better Locality and Parallelism

Machine learning frameworks adopt iterative optimizers to train neural n...
research
12/12/2017

Integrated Model and Data Parallelism in Training Neural Networks

We propose a new integrated method of exploiting both model and data par...

Please sign up or login with your details

Forgot password? Click here to reset