SpotTune: Leveraging Transient Resources for Cost-efficient Hyper-parameter Tuning in the Public Cloud

12/07/2020
by   Yan Li, et al.
0

Hyper-parameter tuning (HPT) is crucial for many machine learning (ML) algorithms. But due to the large searching space, HPT is usually time-consuming and resource-intensive. Nowadays, many researchers use public cloud resources to train machine learning models, convenient yet expensive. How to speed up the HPT process while at the same time reduce cost is very important for cloud ML users. In this paper, we propose SpotTune, an approach that exploits transient revocable resources in the public cloud with some tailored strategies to do HPT in a parallel and cost-efficient manner. Orchestrating the HPT process upon transient servers, SpotTune uses two main techniques, fine-grained cost-aware resource provisioning, and ML training trend predicting, to reduce the monetary cost and runtime of HPT processes. Our evaluations show that SpotTune can reduce the cost by up to 90 improvement.

READ FULL TEXT
research
05/31/2020

Cloud-scale VM Deflation for Running Interactive Applications On Transient Servers

Transient computing has become popular in public cloud environments for ...
research
01/28/2021

Machine learning for cloud resources management – An overview

Nowadays, an important topic that is considered a lot is how to integrat...
research
11/09/2020

TrimTuner: Efficient Optimization of Machine Learning Jobs in the Cloud via Sub-Sampling

This work introduces TrimTuner, the first system for optimizing machine ...
research
08/27/2021

Machine Learning for Performance Prediction of Spark Cloud Applications

Big data applications and analytics are employed in many sectors for a v...
research
06/12/2022

MLLess: Achieving Cost Efficiency in Serverless Machine Learning Training

Function-as-a-Service (FaaS) has raised a growing interest in how to "ta...
research
06/01/2022

Good Intentions: Adaptive Parameter Servers via Intent Signaling

Parameter servers (PSs) ease the implementation of distributed training ...
research
03/16/2018

Snap Machine Learning

We describe an efficient, scalable machine learning library that enables...

Please sign up or login with your details

Forgot password? Click here to reset