Task allocation for decentralized training in heterogeneous environment

11/16/2021
by   Yongyue Chao, et al.
0

The demand for large-scale deep learning is increasing, and distributed training is the current mainstream solution. Ring AllReduce is widely used as a data parallel decentralized algorithm. However, in a heterogeneous environment, each worker calculates the same amount of data, so that there is a lot of waiting time loss among different workers, which makes the algorithm unable to adapt well to heterogeneous clusters. Resources are not used as they should be. In this paper, we design an implementation of static allocation algorithm. The dataset is artificially allocated to each worker, and samples are drawn proportionally for training, thereby speeding up the training speed of the network in a heterogeneous environment. We verify the convergence and influence on training speed of the network model under this algorithm on one machine with multi-card and multi-machine with multi-card. On this basis of feasibility, we propose a self-adaptive allocation algorithm that allows each machine to find the data it needs to adapt to the current environment. The self-adaptive allocation algorithm can reduce the training time by nearly one-third to half compared to the same proportional allocation.In order to better show the applicability of the algorithm in heterogeneous clusters, We replace a poorly performing worker with a good performing worker or add a poorly performing worker to the heterogeneous cluster. Experimental results show that training time will decrease as the overall performance improves. Therefore, it means that resources are fully used. Further, this algorithm is not only suitable for straggler problems, but also for most heterogeneous situations. It can be used as a plug-in for AllReduce and its variant algorithms.

READ FULL TEXT

page 1

page 4

page 7

research
08/29/2023

ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters

As the size of models and datasets grows, it has become increasingly com...
research
02/22/2020

Communication-Efficient Decentralized Learning with Sparsification and Adaptive Peer Selection

Distributed learning techniques such as federated learning have enabled ...
research
09/12/2020

Communication-efficient Decentralized Machine Learning over Heterogeneous Networks

In the last few years, distributed machine learning has been usually exe...
research
08/24/2020

Adaptive Serverless Learning

With the emergence of distributed data, training machine learning models...
research
06/02/2022

Decentralized Training of Foundation Models in Heterogeneous Environments

Training foundation models, such as GPT-3 and PaLM, can be extremely exp...
research
09/17/2019

Heterogeneity-Aware Asynchronous Decentralized Training

Distributed deep learning training usually adopts All-Reduce as the sync...
research
02/17/2021

Oscars: Adaptive Semi-Synchronous Parallel Model for Distributed Deep Learning with Global View

Deep learning has become an indispensable part of life, such as face rec...

Please sign up or login with your details

Forgot password? Click here to reset