HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework

12/14/2021
by   Xupeng Miao, et al.
0

Embedding models have been an effective learning paradigm for high-dimensional data. However, one open issue of embedding models is that their representations (latent factors) often result in large parameter space. We observe that existing distributed training frameworks face a scalability issue of embedding models since updating and retrieving the shared embedding parameters from servers usually dominates the training cycle. In this paper, we propose HET, a new system framework that significantly improves the scalability of huge embedding model training. We embrace skewed popularity distributions of embeddings as a performance opportunity and leverage it to address the communication bottleneck with an embedding cache. To ensure consistency across the caches, we incorporate a new consistency model into HET design, which provides fine-grained consistency guarantees on a per-embedding basis. Compared to previous work that only allows staleness for read operations, HET also utilizes staleness for write operations. Evaluations on six representative tasks show that HET achieves up to 88 up to 20.68x performance speedup over the state-of-the-art baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2022

BagPipe: Accelerating Deep Recommendation Model Training

Deep learning based recommendation models (DLRM) are widely used in seve...
research
10/18/2021

EmbRace: Accelerating Sparse Communication for Distributed Training of NLP Neural Networks

Distributed data-parallel training has been widely used for natural lang...
research
04/17/2021

ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

Because of the superior feature representation ability of deep learning,...
research
10/17/2022

Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference

In this talk, we introduce Merlin HugeCTR. Merlin HugeCTR is an open sou...
research
05/25/2023

Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training

Deep learning is experiencing a rise in foundation models that are expec...
research
07/21/2020

A Framework for Consistency Algorithms

We present a framework that provides deterministic consistency algorithm...

Please sign up or login with your details

Forgot password? Click here to reset