Challenges and Opportunities of Machine Learning for Monitoring and Operational Data Analytics in Quantitative Codesign of Supercomputers

09/15/2022
by   Thomas Jakobsche, et al.
0

This work examines the challenges and opportunities of Machine Learning (ML) for Monitoring and Operational Data Analytics (MODA) in the context of Quantitative Codesign of Supercomputers (QCS). MODA is employed to gain insights into the behavior of current High Performance Computing (HPC) systems to improve system efficiency, performance, and reliability (e.g. through optimizing cooling infrastructure, job scheduling, and application parameter tuning). In this work, we take the position that QCS in general, and MODA in particular, require close exchange with the ML community to realize the full potential of data-driven analysis for the benefit of existing and future HPC systems. This exchange will facilitate identifying the appropriate ML methods to gain insights into current HPC systems and to go beyond expert-based knowledge and rules of thumb.

READ FULL TEXT

page 1

page 2

page 3

research
10/14/2019

DCDB Wintermute: Enabling Online and Holistic Operational Data Analytics on HPC Systems

The complexity of today's HPC systems increases as we move closer to the...
research
06/28/2021

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

As HPC systems grow in complexity, efficient and manageable operation is...
research
10/13/2020

Correlation-wise Smoothing: Lightweight Knowledge Extraction for HPC Monitoring Data

Modern High-Performance Computing (HPC) and data center operators rely m...
research
07/14/2021

Higgs Boson Classification: Brain-inspired BCPNN Learning with StreamBrain

One of the most promising approaches for data analysis and exploration o...
research
05/03/2022

Real-Time Streaming and Event-driven Control of Scientific Experiments

Advancements in scientific instrument sensors and connected devices prov...
research
04/18/2022

A Taxonomy of Error Sources in HPC I/O Machine Learning Models

I/O efficiency is crucial to productivity in scientific computing, but t...

Please sign up or login with your details

Forgot password? Click here to reset