Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes

by   Jean-Matthieu Gallard, et al.

We present a sequence of optimizations to the performance-critical compute kernels of the high-order discontinuous Galerkin solver of the hyperbolic PDE engine ExaHyPE – successively tackling bottlenecks due to SIMD operations, cache hierarchies and restrictions in the software design. Starting from a generic scalar implementation of the numerical scheme, our first optimized variant applies state-of-the-art optimization techniques by vectorizing loops, improving the data layout and using Loop-over-GEMM to perform tensor contractions via highly optimized matrix multiplication functions provided by the LIBXSMM library. We show that memory stalls due to a memory footprint exceeding our L2 cache size hindered the vectorization gains. We therefore introduce a new kernel that applies a sum factorization approach to reduce the kernel's memory footprint and improve its cache locality. With the L2 cache bottleneck removed, we were able to exploit additional vectorization opportunities, by introducing a hybrid Array-of-Structure-of-Array data layout that solves the data layout conflict between matrix multiplications kernels and the point-wise functions to implement PDE-specific terms. With this last kernel, evaluated in a benchmark simulation at high polynomial order, only 2% of the floating point operations are still performed using scalar instructions and 22.5% of the available performance is achieved.


page 1

page 4

page 9


Yet Another Tensor Toolbox for discontinuous Galerkin methods and other applications

The numerical solution of partial differential equations is at the heart...

Monolithic convex limiting in discontinuous Galerkin discretizations of hyperbolic conservation laws

In this work we present a framework for enforcing discrete maximum princ...

Fourier Continuation Discontinuous Galerkin Methods for Linear Hyperbolic Problems

Fourier continuation is an approach used to create periodic extensions o...

Role-Oriented Code Generation in an Engine for Solving Hyperbolic PDE Systems

The development of a high performance PDE solver requires the combined e...

Design of a high-performance GEMM-like Tensor-Tensor Multiplication

We present "GEMM-like Tensor-Tensor multiplication" (GETT), a novel appr...

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

The A64FX CPU is arguably the most powerful Arm-based processor design t...

Forest Packing: Fast, Parallel Decision Forests

Machine learning has an emerging critical role in high-performance compu...

Please sign up or login with your details

Forgot password? Click here to reset