Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

by   Rahul Bera, et al.

Long-latency load requests continue to limit the performance of high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: 1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and 2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: 1) accurately predict which load requests might go off-chip, and 2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads. To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters). For every load request, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative request directly to the memory controller once the load's physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative request to finish, thus hiding the on-chip cache hierarchy access latency from the critical path of the off-chip load. Our evaluation shows that Hermes significantly improves performance of a state-of-the-art baseline. We open-source Hermes.


page 4

page 11

page 13


Reducing Load Latency with Cache Level Prediction

High load latency that results from deep cache hierarchies and relativel...

Data Cache Prefetching with Perceptron Learning

Cache prefetcher greatly eliminates compulsory cache misses, by fetching...

Venice: Improving Solid-State Drive Parallelism at Low Cost via Conflict-Free Accesses

The performance and capacity of solid-state drives (SSDs) are continuous...

Criticality Aware Multiprocessors

Typically, a memory request from a processor may need to go through many...

Practical Data Compression for Modern Memory Hierarchies

In this thesis, we describe a new, practical approach to integrating har...

Holistic Management of the GPGPU Memory Hierarchy to Manage Warp-level Latency Tolerance

In a modern GPU architecture, all threads within a warp execute the same...

An overview about Networks-on-Chip with multicast suppor

Modern System-on-Chip (SoC) platforms typically consist of multiple proc...

Please sign up or login with your details

Forgot password? Click here to reset