Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach

by   Peng Zhang, et al.

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93 performance delivered by a theoretically perfect predictor.


page 3

page 5

page 11

page 13

page 14


Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach

Many-core accelerators, as represented by the XeonPhi coprocessors and G...

Efficient executions of Pipelined Conjugate Gradient Method on Heterogeneous Architectures

The Preconditioned Conjugate Gradient (PCG) method is widely used for so...

HEP-BNN: A Framework for Finding Low-Latency Execution Configurations of BNNs on Heterogeneous Multiprocessor Platforms

Binarized Neural Networks (BNNs) significantly reduce the computation an...

Optimizing Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures

Sparse matrix vector multiplication (SpMV) is one of the most common ope...

Machine Learning-Driven Adaptive OpenMP For Portable Performance on Heterogeneous Systems

Heterogeneity has become a mainstream architecture design choice for bui...

A Graph-based Model for GPU Caching Problems

Modeling data sharing in GPU programs is a challenging task because of t...

Toward Accurate Platform-Aware Performance Modeling for Deep Neural Networks

In this paper, we provide a fine-grain machine learning-based method, Pe...

Please sign up or login with your details

Forgot password? Click here to reset