Compiling Halide Programs to Push-Memory Accelerators

by   Qiaoyi Liu, et al.

Image processing and machine learning applications benefit tremendously from hardware acceleration, but existing compilers target either FPGAs, which sacrifice power and performance for flexible hardware, or ASICs, which rapidly become obsolete as applications change. Programmable domain-specific accelerators have emerged as a promising middle-ground between these two extremes, but such architectures have traditionally been difficult compiler targets. The main obstacle is that these accelerators often use a different memory abstraction than CPUs and GPUs: push memories that send a data stream from one computation kernel to other kernels, possibly reordered. To address the compilation challenges caused by push memories, we propose that the representation of memory in the middle and backend of the compiler be altered to combine storage with address generation and control logic in a single structure – a unified buffer. We show that this compiler abstraction can be implemented efficiently on a programmable accelerator, and design a memory mapping algorithm that combines polyhedral analysis and software vectorization techniques to target our accelerator. Our evaluation shows that the compiler supports programmability while maintaining high performance. It can compile a wide range of image processing and machine learning applications to our accelerator with 4.7x better runtime and 4.3x better energy-efficiency as compared to an FPGA.


page 1

page 9


Compiler Infrastructure for Specializing Domain-Specific Memory Templates

Specialized hardware accelerators are becoming important for more and mo...

Domain-Specific Computational Storage for Serverless Computing

While (1) serverless computing is emerging as a popular form of cloud ex...

A Domain-Extensible Compiler with Controllable Automation of Optimisations

In high performance domains like image processing, physics simulation or...

ImaGen: A General Framework for Generating Memory- and Power-Efficient Image Processing Accelerators

Image processing algorithms are prime targets for hardware acceleration ...

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

To meet the extreme compute demands for deep learning across commercial ...

Instruction-Level Abstraction (ILA): A Uniform Specification for System-on-Chip (SoC) Verification

Modern Systems-on-Chip (SoC) designs are increasingly heterogeneous and ...

SESAME: Software defined Enclaves to Secure Inference Accelerators with Multi-tenant Execution

Hardware-enclaves that target complex CPU designs compromise both securi...

Please sign up or login with your details

Forgot password? Click here to reset