FullPack: Full Vector Utilization for Sub-Byte Quantized Inference on General Purpose CPUs

11/13/2022
by   Hossein Katebi, et al.
0

Although prior art has demonstrated negligible accuracy drop in sub-byte quantization – where weights and/or activations are represented by less than 8 bits – popular SIMD instructions of CPUs do not natively support these datatypes. While recent methods, such as ULPPACK, are already using sub-byte quantization on general-purpose CPUs with vector units, they leave out several empty bits between the sub-byte values in memory and in vector registers to avoid overflow to the neighbours during the operations. This results in memory footprint and bandwidth-usage inefficiencies and suboptimal performance. In this paper, we present memory layouts for storing, and mechanisms for processing sub-byte (4-, 2-, or 1-bit) models that utilize all the bits in the memory as well as in the vector registers for the actual data. We provide compute kernels for the proposed layout for the GEMV (GEneral Matrix-Vector multiplication) operations between weights and activations of different datatypes (e.g., 8-bit activations and 4-bit weights). For evaluation, we extended the TFLite package and added our methods to it, then ran the models on the cycle-accurate gem5 simulator to compare detailed memory and CPU cycles of each method. We compare against nine other methods that are actively used in production including GEMLOWP, Ruy, XNNPack, and ULPPACK. Furthermore, we explore the effect of different input and output sizes of deep learning layers on the performance of our proposed method. Experimental results show 0.96-2.1x speedup for small sizes and 1.2-6.7x speedup for mid to large sizes. Applying our proposal to a real-world speech recognition model, Mozilla DeepSpeech, we proved that our method achieves 1.56-2.11x end-to-end speedup compared to the state-of-the-art, depending on the bit-width employed.

READ FULL TEXT

page 7

page 8

page 9

page 10

page 15

page 16

research
11/30/2016

Effective Quantization Methods for Recurrent Neural Networks

Reducing bit-widths of weights, activations, and gradients of a Neural N...
research
01/31/2017

Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point

We propose a cluster-based quantization method to convert pre-trained fu...
research
05/18/2022

Fast matrix multiplication for binary and ternary CNNs on ARM CPU

Low-bit quantized neural networks are of great interest in practical app...
research
03/15/2023

A Comprehensive Study on Post-Training Quantization for Large Language Models

Post-training quantization () had been recently shown as a compromising ...
research
06/28/2017

Toward Computation and Memory Efficient Neural Network Acoustic Models with Binary Weights and Activations

Neural network acoustic models have significantly advanced state of the ...
research
07/13/2022

Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets

We propose a novel 2-stage sub 8-bit quantization aware training algorit...
research
11/14/2015

8-Bit Approximations for Parallelism in Deep Learning

The creation of practical deep learning data-products often requires par...

Please sign up or login with your details

Forgot password? Click here to reset