A Compilation Flow for the Generation of CNN Inference Accelerators on FPGAs

by   Seung-Hun Chung, et al.

We present a compilation flow for the generation of CNN inference accelerators on FPGAs. The flow translates a frozen model into OpenCL kernels with the TVM compiler and uses the Intel OpenCL SDK to compile to an FPGA bitstream. We improve the quality of the generated hardware with optimizations applied to the base OpenCL kernels generated by TVM. These optimizations increase parallelism, reduce memory access latency, increase concurrency and save on-chip resources. We automate these optimizations in TVM and evaluate them by generating accelerators for LeNet-5, MobileNetV1 and ResNet-34 on an Intel Stratix 10SX. We show that the optimizations improve the performance of the generated accelerators by up to 846X over the base accelerators. The performance of the optimized accelerators is up to 4.57X better than TensorFlow on CPU, 3.83X better than single-threaded TVM and is only 0.34X compared to TVM with 56 threads. Our optimized kernels also outperform ones generated by a similar approach (that also uses high-level synthesis) while providing more functionality and flexibility. However, it underperforms an approach that utilizes hand-optimized designs. Thus, we view our approach as useful in pre-production environments that benefit from increased performance and fast prototyping, realizing the benefits of FPGAs without hardware design expertise.


page 1

page 2

page 3

page 4


Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

Numerical simulations can help solve complex problems. Most of these alg...

A Variable Vector Length SIMD Architecture for HW/SW Co-designed Processors

Hardware/Software (HW/SW) co-designed processors provide a promising sol...

Accelerating CNN inference on long vector architectures via co-design

CPU-based inference can be an alternative to off-chip accelerators, and ...

Computing and Compressing Electron Repulsion Integrals on FPGAs

The computation of electron repulsion integrals (ERIs) over Gaussian-typ...

FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs

Multi-die FPGAs are widely adopted to deploy large hardware accelerators...

Early DSE and Automatic Generation of Coarse Grained Merged Accelerators

Post-Moore's law area-constrained systems rely on accelerators to delive...

Optimizing CNN Model Inference on CPUs

The popularity of Convolutional Neural Network (CNN) models and the ubiq...

Please sign up or login with your details

Forgot password? Click here to reset