NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22nm FD-SOI

12/01/2018
by   Fabian Schuiki, et al.
0

Specialized coprocessors for Multiply-Accumulate (MAC) intensive workloads such as Deep Learning are becoming widespread in SoC platforms, from GPUs to mobile SoCs. In this paper we revisit NTX (an efficient accelerator developed for training Deep Neural Networks at scale) as a generalized MAC and reduction streaming engine. The architecture consists of a set of 32 bit floating-point streaming co-processors that are loosely coupled to a RISC-V core in charge of orchestrating data movement and computation. Post-layout results of a recent silicon implementation in 22 nm FD-SOI technology show the accelerator's capability to deliver up to 20 Gflop/s at 1.25 GHz and 168 mW. Based on these results we show that a version of NTX scaled down to 14 nm can achieve a 3x energy efficiency improvement over contemporary GPUs at 10.4x less silicon area, and a compute performance of 1.4 Tflop/s for training large state-of-the-art networks with full floating-point precision. An extended evaluation of MAC-intensive kernels shows that NTX can consistently achieve up to 87 machine learning. Its modular architecture enables deployment at different scales ranging from high-performance GPU-class to low-power embedded scenarios.

READ FULL TEXT

page 1

page 2

page 4

research
03/02/2020

A New MRAM-based Process In-Memory Accelerator for Efficient Neural Network Training with Floating Point Precision

The excellent performance of modern deep neural networks (DNNs) comes at...
research
02/19/2018

A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

Most investigations into near-memory hardware accelerators for deep neur...
research
08/14/2020

Manticore: A 4096-core RISC-V Chiplet Architecture for Ultra-efficient Floating-point Computing

Data-parallel problems demand ever growing floating-point (FP) operation...
research
02/17/2022

MATCHA: A Fast and Energy-Efficient Accelerator for Fully Homomorphic Encryption over the Torus

Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary comp...
research
02/13/2020

Improving Efficiency in Neural Network Accelerator Using Operands Hamming Distance optimization

Neural network accelerator is a key enabler for the on-device AI inferen...
research
02/01/2021

Understanding Cache Boundness of ML Operators on ARM Processors

Machine Learning compilers like TVM allow a fast and flexible deployment...
research
01/31/2023

Tricking AI chips into Simulating the Human Brain: A Detailed Performance Analysis

Challenging the Nvidia monopoly, dedicated AI-accelerator chips have beg...

Please sign up or login with your details

Forgot password? Click here to reset