Vertical Layering of Quantized Neural Networks for Heterogeneous Inference

by   Hai Wu, et al.

Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.


Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance

We introduce a quantization-aware training algorithm that guarantees avo...

Switchable Precision Neural Networks

Instantaneous and on demand accuracy-efficiency trade-off has been recen...

One Model for All Quantization: A Quantized Network Supporting Hot-Swap Bit-Width Adjustment

As an effective technique to achieve the implementation of deep neural n...

Mixed-Precision Quantized Neural Network with Progressively Decreasing Bitwidth For Image Classification and Object Detection

Efficient model inference is an important and practical issue in the dep...

Combinatorial optimization for low bit-width neural networks

Low-bit width neural networks have been extensively explored for deploym...

Bit-Mixer: Mixed-precision networks with runtime bit-width selection

Mixed-precision networks allow for a variable bit-width quantization for...

Batch Normalization in Quantized Networks

Implementation of quantized neural networks on computing hardware leads ...

Please sign up or login with your details

Forgot password? Click here to reset