Scalable and Efficient Virtual Memory Sharing in Heterogeneous SoCs with TLB Prefetching and MMU-Aware DMA Engine

by   Andreas Kurth, et al.

Shared virtual memory (SVM) is key in heterogeneous systems on chip (SoCs), which combine a general-purpose host processor with a many-core accelerator, both for programmability and to avoid data duplication. However, SVM can bring a significant run time overhead when translation lookaside buffer (TLB) entries are missing. Moreover, allowing DMA burst transfers to write SVM traditionally requires buffers to absorb transfers that miss in the TLB. These buffers have to be overprovisioned for the maximum burst size, wasting precious on-chip memory, and stall all SVM accesses once they are full, hampering the scalability of parallel accelerators. In this work, we present our SVM solution that avoids the majority of TLB misses with prefetching, supports parallel burst DMA transfers without additional buffers, and can be scaled with the workload and number of parallel processors. Our solution is based on three novel concepts: To minimize the rate of TLB misses, the TLB is proactively filled by compiler-generated Prefetching Helper Threads, which use run-time information to issue timely prefetches. To reduce the latency of TLB misses, misses are handled by a variable number of parallel Miss Handling Helper Threads. To support parallel burst DMA transfers to SVM without additional buffers, we add lightweight hardware to a standard DMA engine to detect and react to TLB misses. Compared to the state of the art, our work improves accelerator performance for memory-intensive kernels by up to 4x and by up to 60 respectively.


page 1

page 2

page 3

page 4


SPARTA: A Divide and Conquer Approach to Address Translation for Accelerators

Virtual memory (VM) is critical to the usability and programmability of ...

On Linear Learning with Manycore Processors

A new generation of manycore processors is on the rise that offers dozen...

HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous Computing

Heterogeneous computers integrate general-purpose host processors with d...

Sidebar: Scratchpad Based Communication Between CPUs and Accelerators

Hardware accelerators for neural networks have shown great promise for b...

Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCs

One of the most critical aspects of integrating loosely-coupled accelera...

MacLeR: Machine Learning-based Run-Time Hardware Trojan Detection in Resource-Constrained IoT Edge Devices

Traditional learning-based approaches for run-time Hardware Trojan detec...

Virtual-Threading: Advanced General Purpose Processors Architecture

The paper describes the new computers architecture, the main features of...

Please sign up or login with your details

Forgot password? Click here to reset