HAWQV3: Dyadic Neural Network Quantization

11/20/2020
by   Zhewei Yao, et al.
7

Quantization is one of the key techniques used to make Neural Networks (NNs) faster and more energy efficient. However, current low precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing NNs. To address this, we present HAWQV3, a novel dyadic quantization framework. The contributions of HAWQV3 are the following. (i) The entire inference process consists of only integer multiplication, addition, and bit shifting in INT4/8 mixed precision, without any floating point operations/casting or even integer division. (ii) We pose the mixed-precision quantization as an integer linear programming problem, where the bit precision setting is computed to minimize model perturbation, while observing application specific constraints on memory footprint, latency, and BOPS. (iii) To verify our approach, we develop the first open source 4-bit mixed-precision quantization in TVM, and we directly deploy the quantized models to T4 GPUs using only the Turing Tensor Cores. We observe an average speed up of 1.45× for uniform 4-bit, as compared to uniform 8-bit, precision for ResNet50. (iv) We extensively test the proposed dyadic quantization approach on multiple different NNs, including ResNet18/50 and InceptionV3, for various model compression levels with/without mixed precision. For instance, we achieve an accuracy of 78.50% with dyadic INT8 quantization, which is more than 4% higher than prior integer-only work for InceptionV3. Furthermore, we show that mixed-precision INT4/8 quantization can be used to achieve higher speed ups, as compared to INT8 inference, with minimal impact on accuracy. For example, for ResNet50 we can reduce INT8 latency by 23% with mixed precision and still achieve 76.73% accuracy.

READ FULL TEXT

page 1

page 5

page 9

page 11

research
04/18/2023

DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables

A lot of recent progress has been made in ultra low-bit quantization, pr...
research
10/14/2022

Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization

This paper presents an optimized methodology to design and deploy Speech...
research
09/19/2022

SAMP: A Toolkit for Model Inference with Self-Adaptive Mixed-Precision

The latest industrial inference engines, such as FasterTransformer1 and ...
research
07/07/2023

INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers

The recent rise of large language models (LLMs) has resulted in increase...
research
01/07/2019

DSConv: Efficient Convolution Operator

We introduce a variation of the convolutional layer called DSConv (Distr...
research
05/30/2019

Memory-Driven Mixed Low Precision Quantization For Enabling Deep Network Inference On Microcontrollers

This paper presents a novel end-to-end methodology for enabling the depl...
research
01/20/2022

Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)

While neural networks have advanced the frontiers in many machine learni...

Please sign up or login with your details

Forgot password? Click here to reset