Collectives in hybrid MPI+MPI code: design, practice and performance

by   Huan Zhou, et al.

The use of hybrid scheme combining the message passing programming models for inter-node parallelism and the shared memory programming models for node-level parallelism is widely spread. Existing extensive practices on hybrid Message Passing Interface (MPI) plus Open Multi-Processing (OpenMP) programming account for its popularity. Nevertheless, strong programming efforts are required to gain performance benefits from the MPI+OpenMP code. An emerging hybrid method that combines MPI and the MPI shared memory model (MPI+MPI) is promising. However, writing an efficient hybrid MPI+MPI program – especially when the collective communication operations are involved – is not to be taken for granted. In this paper, we propose a new design method to implement hybrid MPI+MPI context-based collective communication operations. Our method avoids on-node memory replications (on-node communication overheads) that are required by semantics in pure MPI. We also offer wrapper primitives hiding all the design details from users, which comes with practices on how to structure hybrid MPI+MPI code with these primitives. The micro-benchmarks show that our collectives are comparable or superior to those in pure MPI context. We have further validated the effectiveness of the hybrid MPI+MPI model (which uses our wrapper primitives) in three computational kernels, by comparison to the pure MPI and hybrid MPI+OpenMP models.


MPI Collectives for Multi-core Clusters: Optimized Performance of the Hybrid MPI+MPI Parallel Codes

The advent of multi-/many-core processors in clusters advocates hybrid p...

Accurate runtime selection of optimal MPI collective algorithms using analytical performance modelling

The performance of collective operations has been a critical issue since...

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

The collective operations are considered critical for improving the perf...

Improving the performance of classical linear algebra iterative methods via hybrid parallelism

We propose fork-join and task-based hybrid implementations of four class...

A Hybrid MPI+Threads Approach to Particle Group Finding Using Union-Find

The Friends-of-Friends (FoF) algorithm is a standard technique used in c...

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Modern interconnects offer remote direct memory access (RDMA) features. ...

Leveraging MPI RMA to optimise halo-swapping communications in MONC on Cray machines

Remote Memory Access (RMA), also known as single sided communications, p...

Please sign up or login with your details

Forgot password? Click here to reset