Understanding and Overcoming the Challenges of Efficient Transformer Quantization

09/27/2021
by   Yelysei Bondarenko, et al.
3

Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges – namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme – per-embedding-group quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss. Our source code is available at <https://github.com/qualcomm-ai-research/transformer-quantization>.

READ FULL TEXT

page 6

page 7

research
06/27/2021

Post-Training Quantization for Vision Transformer

Recently, transformer has achieved remarkable performance on a variety o...
research
09/27/2022

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

Transformer architecture has become the fundamental element of the wides...
research
03/22/2023

Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction

Recently, vision transformers (ViT) have replaced convolutional neural n...
research
10/29/2022

Empirical Evaluation of Post-Training Quantization Methods for Language Tasks

Transformer-based architectures like BERT have achieved great success in...
research
10/26/2021

Qu-ANTI-zation: Exploiting Quantization Artifacts for Achieving Adversarial Outcomes

Quantization is a popular technique that transforms the parameter repres...
research
06/22/2023

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Transformer models have been widely adopted in various domains over the ...
research
04/08/2023

SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers

Transformers' compute-intensive operations pose enormous challenges for ...

Please sign up or login with your details

Forgot password? Click here to reset