Sequence-to-sequence models for workload interference

by   David Buchaca Prats, et al.

Co-scheduling of jobs in data-centers is a challenging scenario, where jobs can compete for resources yielding to severe slowdowns or failed executions. Efficient job placement on environments where resources are shared requires awareness on how jobs interfere during execution, to go far beyond ineffective resource overbooking techniques. Current techniques, most of them already involving machine learning and job modeling, are based on workload behavior summarization across time, instead of focusing on effective job requirements at each instant of the execution. In this work we propose a methodology for modeling co-scheduling of jobs on data-centers, based on their behavior towards resources and execution time, using sequence-to-sequence models based on recurrent neural networks. The goal is to forecast co-executed jobs footprint on resources along their execution time, from the profile shown by the individual jobs, to enhance resource managers and schedulers placement decisions. The methods here presented are validated using High Performance Computing benchmarks based on different frameworks (like Hadoop and Spark) and applications (CPU bound, IO bound, machine learning, SQL queries...). Experiments show that the model can correctly identify the resource usage trends from previously seen and even unseen co-scheduled jobs.


CloudCoaster: Transient-aware Bursty Datacenter Workload Scheduling

Today's clusters often have to divide resources among a diverse set of j...

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Training Deep Neural Networks (DNNs) is a widely popular workload in bot...

Predictive Performance Modeling for Distributed Computing using Black-Box Monitoring and Machine Learning

In many domains, the previous decade was characterized by increasing dat...

Hybrid Genetic Algorithm for Cloud Computing Applications

In this paper with the aid of genetic algorithm and fuzzy theory, we pre...

Machine Learning for Predictive Analytics of Compute Cluster Jobs

We address the problem of predicting whether sufficient memory and CPU r...

End-to-end Adaptive Distributed Training on PaddlePaddle

Distributed training has become a pervasive and effective approach for t...

Please sign up or login with your details

Forgot password? Click here to reset