Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver

by   Dominic E. Charrier, et al.

We study the performance behaviour of a seismic simulation using the ExaHyPE engine with a specific focus on memory characteristics and energy needs. ExaHyPE combines dynamically adaptive mesh refinement (AMR) with ADER-DG. It is parallelized using tasks, and it is cache efficient. AMR plus ADER-DG yields a task graph which is highly dynamic in nature and comprises both arithmetically expensive tasks and tasks which challenge the memory's latency. The expensive tasks and thus the whole code benefit from AVX vectorization, though we suffer from memory access bursts. A frequency reduction of the chip improves the code's energy-to-solution. Yet, it does not mitigate burst effects. The bursts' latency penalty becomes worse once we add Intel Optane technology, increase the core count significantly, or make individual, computationally heavy tasks fall out of close caches. Thread overbooking to hide away these latency penalties contra-productive with non-inclusive caches as it destroys the cache and vectorization character. In cases where memory-intense and computationally expensive tasks overlap, ExaHyPE's cache-oblivious implementation can exploit deep, non-inclusive, heterogeneous memory effectively, as main memory misses arise infrequently and slow down only few cores. We thus propose that upcoming supercomputing simulation codes with dynamic, inhomogeneous task graphs are actively supported by thread runtimes in intermixing tasks of different compute character, and we propose that future hardware actively allows codes to downclock the cores running particular task types.


page 1

page 2

page 3

page 4


Using Silent Writes in Low-Power Traffic-Aware ECC

Using Error Detection Code (EDC) and Error Correction Code (ECC) is a no...

Reducing Competitive Cache Misses in Modern Processor Architectures

The increasing number of threads inside the cores of a multicore process...

On-Chip Mechanisms to Reduce Effective Memory Access Latency

This dissertation develops hardware that automatically reduces the effec...

Energy-efficient Non Uniform Last Level Caches for Chip-multiprocessors Based on Compression

With technology scaling, the size of cache systems in chip-multiprocesso...

RevaMp3D: Architecting the Processor Core and Cache Hierarchy for Systems with Monolithically-Integrated Logic and Memory

Recent nano-technological advances enable the Monolithic 3D (M3D) integr...

Flexible Support for Fast Parallel Commutative Updates

Privatizing data is a useful strategy for increasing parallelism in a sh...

Uber: Utilizing Buffers to Simplify NoCs for Hundreds-Cores

Approaching ideal wire latency using a network-on-chip (NoC) is an impor...

Please sign up or login with your details

Forgot password? Click here to reset