An Experimental Evaluation of Large Scale GBDT Systems

by   Fangcheng Fu, et al.
Beijing University of Posts and Telecommunications
Peking University

Gradient boosting decision tree (GBDT) is a widely-used machine learning algorithm in both data analytic competitions and real-world industrial applications. Further, driven by the rapid increase in data volume, efforts have been made to train GBDT in a distributed setting to support large-scale workloads. However, we find it surprising that the existing systems manage the training dataset in different ways, but none of them have studied the impact of data management. To that end, this paper aims to study the pros and cons of different data management methods regarding the performance of distributed GBDT. We first introduce a quadrant categorization of data management policies based on data partitioning and data storage. Then we conduct an in-depth systematic analysis and summarize the advantageous scenarios of the quadrants. Based on the analysis, we further propose a novel distributed GBDT system named Vero, which adopts the unexplored composition of vertical partitioning and row-store and suits for many large-scale cases. To validate our analysis empirically, we implement different quadrants in the same code base and compare them under extensive workloads, and finally compare Vero with other state-of-the-art systems over a wide range of datasets. Our theoretical and experimental results provide a guideline on choosing a proper data management policy for a given workload.


page 1

page 2

page 3

page 4


Efficient Partitioning Method of Large-Scale Public Safety Spatio-Temporal Data based on Information Loss Constraints

The storage, management, and application of massive spatio-temporal data...

Artificial Intelligence (AI)-Centric Management of Resources in Modern Distributed Computing Systems

Contemporary Distributed Computing Systems (DCS) such as Cloud Data Cent...

Deductive Optimization of Relational Data Storage

Optimizing the physical data storage and retrieval of data are two key d...

Scaling Out Acid Applications with Operation Partitioning

OLTP applications with high workloads that cannot be served by a single ...

Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads

LSM-trees are widely adopted as the storage backend of key-value stores....

Updatable Learned Indexes Meet Disk-Resident DBMS – From Evaluations to Design Choices

Although many updatable learned indexes have been proposed in recent yea...

Please sign up or login with your details

Forgot password? Click here to reset