ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric Systems

by   Nika Mansouri-Ghiasi, et al.

Partitioning applications between NDP and host CPU cores causes inter-segment data movement overhead, which is caused by moving data generated from one segment (e.g., instructions, functions) and used in consecutive segments. Prior works take two approaches to this problem. The first class of works maps segments to NDP or host cores based on the properties of each segment, neglecting the inter-segment data movement overhead. The second class of works partitions applications based on the overall memory bandwidth saving of each segment, and does not offload each segment to the best-fitting core if they incur high inter-segment data movement. We show that 1) mapping each segment to its best-fitting core ideally can provide substantial benefits, and 2) the inter-segment data movement reduces this benefit significantly. To this end, we introduce ALP, a new programmer-transparent technique to leverage the performance benefits of NDP by alleviating the inter-segment data movement overhead between host and memory and enabling efficient partitioning of applications. ALP alleviates the inter-segment data movement overhead by proactively and accurately transferring the required data between the segments. This is based on the key observation that the instructions that generate the inter-segment data stay the same across different executions of a program on different inputs. ALP uses a compiler pass to identify these instructions and uses specialized hardware to transfer data between the host and NDP cores at runtime. ALP efficiently maps application segments to either host or NDP considering 1) the properties of each segment, 2) the inter-segment data movement overhead, and 3) whether this overhead can be alleviated in a timely manner. We evaluate ALP across a wide range of workloads and show on average 54.3 respectively.


page 10

page 11

page 15

page 16


Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

Many modern workloads such as neural network inference and graph process...

Revitalizing Copybacks in Modern SSDs: Why and How

For modern flash-based SSDs, the performance overhead of internal data m...

CHoNDA: Near Data Acceleration with Concurrent Host Access

Near-data accelerators (NDAs) that are integrated with main memory have ...

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Data movement between the CPU and main memory is a first-order obstacle ...

IOCA: High-Speed I/O-Aware LLC Management for Network-Centric Multi-Tenant Platform

In modern server CPUs, last-level cache (LLC) is a critical hardware res...

Modeling Data Movement Performance on Heterogeneous Architectures

The cost of data movement on parallel systems varies greatly with machin...

TZC: Efficient Inter-Process Communication for Robotics Middleware with Partial Serialization

Inter-process communication (IPC) is one of the core functions of modern...

Please sign up or login with your details

Forgot password? Click here to reset