Data Driven Resource Allocation for Distributed Learning

12/15/2015
by   Travis Dick, et al.
0

In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/09/2020

Data-driven Competitive Algorithms for Online Knapsack and Set Cover

The design of online algorithms has tended to focus on algorithms with w...
research
05/26/2021

A data-driven approach to beating SAA out-of-sample

While solutions of Distributionally Robust Optimization (DRO) problems c...
research
08/11/2021

Learning to Hash Robustly, with Guarantees

The indexing algorithms for the high-dimensional nearest neighbor search...
research
09/24/2022

The Online Knapsack Problem with Departures

The online knapsack problem is a classic online resource allocation prob...
research
02/13/2017

Is a Data-Driven Approach still Better than Random Choice with Naive Bayes classifiers?

We study the performance of data-driven, a priori and random approaches ...
research
01/27/2018

Variance-Optimal Offline and Streaming Stratified Random Sampling

Stratified random sampling (SRS) is a fundamental sampling technique tha...
research
07/16/2014

In Defense of MinHash Over SimHash

MinHash and SimHash are the two widely adopted Locality Sensitive Hashin...

Please sign up or login with your details

Forgot password? Click here to reset