signSGD with Majority Vote is Communication Efficient And Byzantine Fault Tolerant

10/11/2018
by   Jeremy Bernstein, et al.
0

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses 32× less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. We model adversaries as those workers who may compute a stochastic gradient estimate and manipulate it, but may not coordinate with other adversaries. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50 adversarially. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25 when using 15 AWS p3.2xlarge machines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/13/2018

signSGD: compressed optimisation for non-convex problems

Training large neural networks requires distributing learning across mul...
research
06/17/2020

Communication-Efficient Robust Federated Learning Over Heterogeneous Datasets

This work investigates fault-resilient federated learning when the data ...
research
02/15/2023

Sparse-SignSGD with Majority Vote for Communication-Efficient Distributed Learning

The training efficiency of complex deep learning models can be significa...
research
08/22/2018

Don't Use Large Mini-Batches, Use Local SGD

Mini-batch stochastic gradient methods are the current state of the art ...
research
12/30/2020

Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability

Distributed deep learning is an effective way to reduce the training tim...
research
06/03/2020

Local SGD With a Communication Overhead Depending Only on the Number of Workers

We consider speeding up stochastic gradient descent (SGD) by parallelizi...
research
11/15/2020

Echo-CGC: A Communication-Efficient Byzantine-tolerant Distributed Machine Learning Algorithm in Single-Hop Radio Network

In this paper, we focus on a popular DML framework – the parameter serve...

Please sign up or login with your details

Forgot password? Click here to reset