Using Hierarchical Parallelism to Accelerate the Solution of Many Small Partial Differential Equations
This paper presents efforts to improve the hierarchical parallelism of a two scale simulation code. Two methods to improve the GPU parallel performance were developed and compared. The first used the NVIDIA Multi-Process Service and the second moved the entire sub-problem loop into a single kernel using Kokkos hierarchical parallelism and a PackedView data structure. Both approaches improved parallel performance with the second method providing the greatest improvements.
READ FULL TEXT