Language Modeling at Scale

10/23/2018
by   Mostofa Patwary, et al.
0

We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine translation. Scaling up LM is important since it is widely accepted by the community that there is no data like more data. Eventually, we would like to train on terabytes (TBs) of text (trillions of words). Modern training methods are far from this goal, because of various bottlenecks, especially memory (within GPUs) and communication (across GPUs). This paper shows how Zipf's Law can address these bottlenecks by grouping parameters for common words and character sequences, because U ≪ N, where U is the number of unique words (types) and N is the size of the training set (tokens). For a local batch size K with G GPUs and a D-dimension embedding matrix, we reduce the original per-GPU memory and communication asymptotic complexity from Θ(GKD) to Θ(GK + UD). Empirically, we find U ∝ (GK)^0.64 on four publicly available large datasets. When we scale up the number of GPUs to 64, a factor of 8, training time speeds up by factors up to 6.7× (for character LMs) and 6.3× (for word LMs) with negligible loss of accuracy. Our weak scaling on 192 GPUs on the Tieba dataset shows a 35% improvement in LM prediction accuracy by training on 93 GB of data (2.5× larger than publicly available SOTA dataset), but taking only 1.25× increase in training time, compared to 3 GB of the same dataset running on 6 GPUs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/01/2017

Deep Learning Scaling is Predictable, Empirically

Deep learning (DL) creates impactful advances following a virtuous recip...
research
08/03/2018

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Recent work has shown how to train Convolutional Neural Networks (CNNs) ...
research
05/25/2023

Scaling Data-Constrained Language Models

The current trend of scaling language models involves increasing both pa...
research
05/17/2022

Moving Stuff Around: A study on efficiency of moving documents into memory for Neural IR models

When training neural rankers using Large Language Models, it's expected ...
research
09/19/2019

A Random Gossip BMUF Process for Neural Language Modeling

LSTM language model is an essential component of industrial ASR systems....
research
07/24/2018

An argument in favor of strong scaling for deep neural networks with small datasets

In recent years, with the popularization of deep learning frameworks and...
research
10/20/2021

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role...

Please sign up or login with your details

Forgot password? Click here to reset