A Machine Learning Data Processing Framework

by   Derek G. Murray, et al.

Training machine learning models requires feeding input data for models to ingest. Input pipelines for machine learning jobs are often challenging to implement efficiently as they require reading large volumes of data, applying complex transformations, and transferring data to hardware accelerators while overlapping computation and communication to achieve optimal performance. We present, a framework for building and executing efficient input pipelines for machine learning jobs. The API provides operators which can be parameterized with user-defined computation, composed, and reused across different machine learning domains. These abstractions allow users to focus on the application logic of data processing, while's runtime ensures that pipelines run efficiently. We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models. delivers the high performance required, while avoiding the need for manual tuning of performance knobs. We show that features, such as parallelism, caching, static optimizations, and non-deterministic execution are essential for high performance. Finally, we characterize machine learning input pipelines for millions of jobs that ran in Google's fleet, showing that input data processing is highly diverse and consumes a significant fraction of job resources. Our analysis motivates future research directions, such as sharing computation across jobs and pushing data projection to the storage layer.


Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines

Input pipelines, which ingest and transform input data, are an essential...

A case for disaggregation of ML data processing

Machine Learning (ML) computation requires feeding input data for the mo...

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Data processing frameworks such as Apache Beam and Apache Spark are used...

Data Pricing in Machine Learning Pipelines

Machine learning is disruptive. At the same time, machine learning can o...

Efficient Runtime Profiling for Black-box Machine Learning Services on Sensor Streams

In highly distributed environments such as cloud, edge and fog computing...

Koji: Automating pipelines with mixed-semantics data sources

We propose a new result-oriented semantic for defining data processing w...

A Scalable AutoML Approach Based on Graph Neural Networks

AutoML systems build machine learning models automatically by performing...

Please sign up or login with your details

Forgot password? Click here to reset