DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

12/07/2022
by   Conglong Li, et al.
0

Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root cause, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pertaining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency library, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, it provides efficient data sampling via curriculum learning, and efficient data routing via random layerwise token dropping. DeepSpeed Data Efficiency takes extensibility, flexibility and composability into consideration, so that users can easily utilize the framework to compose multiple techniques and apply customized strategies. By applying our solution to GPT-3 1.3B and BERT-Large language model pretraining, we can achieve similar model quality with up to 2x less data and 2x less time, or achieve better model quality under similar amount of data and time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2022

Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation

Methods for improving the efficiency of deep network training (i.e. the ...
research
12/02/2021

Training Efficiency and Robustness in Deep Learning

Deep Learning has revolutionized machine learning and artificial intelli...
research
08/13/2021

Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training

Recent works have demonstrated great success in training high-capacity a...
research
10/28/2017

Customized Routing Optimization Based on Gradient Boost Regressor Model

In this paper, we discussed limitation of current electronic-design-auto...
research
12/10/2022

SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing

The mixture of Expert (MoE) parallelism is a recent advancement that sca...
research
08/22/2023

Tryage: Real-time, intelligent Routing of User Prompts to Large Language Models

The introduction of the transformer architecture and the self-attention ...
research
12/15/2017

Learning when to skim and when to read

Many recent advances in deep learning for natural language processing ha...

Please sign up or login with your details

Forgot password? Click here to reset