SOL: Safe On-Node Learning in Cloud Platforms

01/25/2022
by   Yawen Wang, et al.
0

Cloud platforms run many software agents on each server node. These agents manage all aspects of node operation, and in some cases frequently collect data and make decisions. Unfortunately, their behavior is typically based on pre-defined static heuristics or offline analysis; they do not leverage on-node machine learning (ML). In this paper, we first characterize the spectrum of node agents in Azure, and identify the classes of agents that are most likely to benefit from on-node ML. We then propose SOL, an extensible framework for designing ML-based agents that are safe and robust to the range of failure conditions that occur in production. SOL provides a simple API to agent developers and manages the scheduling and running of the agent-specific functions they write. We illustrate the use of SOL by implementing three ML-based agents that manage CPU cores, node power, and memory placement. Our experiments show that (1) ML substantially improves our agents, and (2) SOL ensures that agents operate safely under a variety of failure conditions. We conclude that ML-based agents show significant potential and that SOL can help build them.

READ FULL TEXT
research
06/25/2019

Software Engineering Practices for Machine Learning

In the last couple of years we have witnessed an enormous increase of ma...
research
09/30/2022

Towards Implementing ML-Based Failure Detectors

Most existing failure detection algorithms rely on statistical methods, ...
research
08/23/2022

A Review of Machine Learning-based Failure Management in Optical Networks

Failure management plays a significant role in optical networks. It ensu...
research
12/02/2022

Measuring Competency of Machine Learning Systems and Enforcing Reliability

We explore the impact of environmental conditions on the competency of m...
research
03/18/2020

ContainerStress: Autonomous Cloud-Node Scoping Framework for Big-Data ML Use Cases

Deploying big-data Machine Learning (ML) services in a cloud environment...
research
04/18/2020

Remote Source Coding

We apply the framework of imperfect empirical coordination to a two-node...
research
04/18/2020

Remote Empirical Coordination

We apply the framework of imperfect empirical coordination to a two-node...

Please sign up or login with your details

Forgot password? Click here to reset