Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

01/19/2022
by   Zhongyi Lin, et al.
7

We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10 in all kernel performance modeling, and 4.61 active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19 E2E prediction error with shared overheads across workloads suggests the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods.

READ FULL TEXT

page 1

page 5

research
11/11/2020

Understanding Training Efficiency of Deep Learning Recommendation Models at Scale

The use of GPUs has proliferated for machine learning workflows and is n...
research
09/13/2022

Deep Learning Training on Multi-Instance GPUs

Deep learning training is an expensive process that extensively uses GPU...
research
01/17/2023

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

Recent advances in deep learning base on growing model sizes and the nec...
research
10/01/2021

Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads

We investigate the performance of the concurrency mechanisms available o...
research
10/16/2021

Hydra: A System for Large Multi-Model Deep Learning

In many deep learning (DL) applications, the desire for ever higher accu...
research
03/08/2023

RAF: Holistic Compilation for Deep Learning Model Training

As deep learning is pervasive in modern applications, many deep learning...

Please sign up or login with your details

Forgot password? Click here to reset