Understanding and Co-designing the Data Ingestion Pipeline for Industry-Scale RecSys Training

08/20/2021
by   Mark Zhao, et al.
0

The data ingestion pipeline, responsible for storing and preprocessing training data, is an important component of any machine learning training job. At Facebook, we use recommendation models extensively across our services. The data ingestion requirements to train these models are substantial. In this paper, we present an extensive characterization of the data ingestion challenges for industry-scale recommendation model training. First, dataset storage requirements are massive and variable; exceeding local storage capacities. Secondly, reading and preprocessing data is computationally expensive, requiring substantially more compute, memory, and network resources than are available on trainers themselves. These demands result in drastically reduced training throughput, and thus wasted GPU resources, when current on-trainer preprocessing solutions are used. To address these challenges, we present a disaggregated data ingestion pipeline. It includes a central data warehouse built on distributed storage nodes. We introduce Data PreProcessing Service (DPP), a fully disaggregated preprocessing service that scales to hundreds of nodes, eliminating data stalls that can reduce training throughput by 56 storage and preprocessing throughput by 1.9x and 2.3x, respectively, addressing the substantial power requirements of data ingestion. We close with lessons learned and cover the important remaining challenges and opportunities surrounding data ingestion at scale.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2022

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

We present RecD (Recommendation Deduplication), a suite of end-to-end in...
research
04/18/2023

Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks

In this paper, we primarily focus on understanding the data preprocessin...
research
12/03/2018

Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud

Deep Learning system architects strive to design a balanced system where...
research
01/05/2022

Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising

Click-Through Rate (CTR) prediction is a crucial component in the online...
research
08/13/2023

InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

Deep learning-based recommender models (DLRMs) have become an essential ...
research
10/17/2020

Check-N-Run: A Checkpointing System for Training Recommendation Models

Checkpoints play an important role in training recommendation systems at...
research
12/24/2012

Fully scalable online-preprocessing algorithm for short oligonucleotide microarray atlases

Accumulation of standardized data collections is opening up novel opport...

Please sign up or login with your details

Forgot password? Click here to reset