Empirical Evaluation of Post-Training Quantization Methods for Language Tasks

by   Ting Hu, et al.

Transformer-based architectures like BERT have achieved great success in a wide range of Natural Language tasks. Despite their decent performance, the models still have numerous parameters and high computational complexity, impeding their deployment in resource-constrained environments. Post-Training Quantization (PTQ), which enables low-bit computations without extra training, could be a promising tool. In this work, we conduct an empirical evaluation of three PTQ methods on BERT-Base and BERT-Large: Linear Quantization (LQ), Analytical Clipping for Integer Quantization (ACIQ), and Outlier Channel Splitting (OCS). OCS theoretically surpasses the others in minimizing the Mean Square quantization Error and avoiding distorting the weights' outliers. That is consistent with the evaluation results of most language tasks of GLUE benchmark and a reading comprehension task, SQuAD. Moreover, low-bit quantized BERT models could outperform the corresponding 32-bit baselines on several small language tasks, which we attribute to the alleviation of over-parameterization. We further explore the limit of quantization bit and show that OCS could quantize BERT-Base and BERT-Large to 3-bits and retain 98 and 96 conduct quantization on the whole BERT family, i.e., BERT models in different configurations, and comprehensively evaluate their performance on the GLUE benchmark and SQuAD, hoping to provide valuable guidelines for their deployment in various computation environments.


page 1

page 2

page 3

page 4


Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Transformer-based architectures have become the de-facto standard models...

KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization

Recently, transformer-based language models such as BERT have shown trem...

Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

We investigate the effects of post-training quantization and quantizatio...

Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling

Quantization of transformer language models faces significant challenges...

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

Transformer architecture has become the fundamental element of the wides...

RPTQ: Reorder-based Post-training Quantization for Large Language Models

Large-scale language models (LLMs) have demonstrated outstanding perform...

Improving Neural Network Quantization without Retraining using Outlier Channel Splitting

Quantization can improve the execution latency and energy efficiency of ...

Please sign up or login with your details

Forgot password? Click here to reset