A Taxonomy of Error Sources in HPC I/O Machine Learning Models

04/18/2022
by   Mihailo Isakov, et al.
0

I/O efficiency is crucial to productivity in scientific computing, but the increasing complexity of the system and the applications makes it difficult for practitioners to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and underperform after being deployed. We analyze multiple years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models underperform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.

READ FULL TEXT

page 4

page 7

page 10

research
01/20/2023

ARcode: HPC Application Recognition Through Image-encoded Monitoring Data

Knowing HPC applications of jobs and analyzing their performance behavio...
research
02/03/2018

JobPruner: A Machine Learning Assistant for Exploring Parameter Spaces in HPC Applications

High Performance Computing (HPC) applications are essential for scientis...
research
09/05/2019

Understanding ML driven HPC: Applications and Infrastructure

We recently outlined the vision of "Learning Everywhere" which captures ...
research
03/16/2021

Intelligent colocation of HPC workloads

Many HPC applications suffer from a bottleneck in the shared caches, ins...
research
09/14/2018

Multiple Workflows Scheduling in Multi-tenant Distributed Systems: A Taxonomy and Future Directions

Scientific workflows are commonly used to automate scientific experiment...
research
07/26/2023

Sources of Opacity in Computer Systems: Towards a Comprehensive Taxonomy

Modern computer systems are ubiquitous in contemporary life yet many of ...

Please sign up or login with your details

Forgot password? Click here to reset