ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

06/04/2022
by   Zhewei Yao, et al.
0

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (LKD) even without the access to the original training data; (3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead. As such, we are able to show that: (1) ZeroQuant can reduce the precision for weights and activations to INT8 in a cost-free way for both BERT and GPT3-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on those models compared to FP16 inference; (2) ZeroQuant plus LKD affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model; (3) ZeroQuant can be directly applied to two of the largest open-sourced language models, including GPT-J6B and GPT-NeoX20, for which our INT8 model achieves similar accuracy as the FP16 model but achieves up to 5.2x better efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2022

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Large language models (LLMs) show excellent performance but are compute-...
research
01/27/2023

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

Improving the deployment efficiency of transformer-based language models...
research
09/30/2021

Towards Efficient Post-training Quantization of Pre-trained Language Models

Network quantization has gained increasing attention with the rapid grow...
research
08/16/2023

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Large Language Models (LLMs) have achieved state-of-the-art performance ...
research
07/19/2018

Statistical Model Compression for Small-Footprint Natural Language Understanding

In this paper we investigate statistical model compression applied to na...
research
01/02/2023

Massive Language Models Can Be Accurately Pruned in One-Shot

We show for the first time that large-scale generative pretrained transf...
research
09/14/2022

Analysis of Quantization on MLP-based Vision Models

Quantization is wildly taken as a model compression technique, which obt...

Please sign up or login with your details

Forgot password? Click here to reset