GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

10/31/2022
by   Elias Frantar, et al.
0

Generative Pre-trained Transformer (GPT) models set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs to execute, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 2x when using high-end GPUs (NVIDIA A100) and 4x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/05/2023

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Recent advances in large language model (LLM) pretraining have led to hi...
research
10/14/2022

oViT: An Accurate Second-Order Pruning Framework for Vision Transformers

Models from the Vision Transformer (ViT) family have recently provided b...
research
11/07/2022

TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

Tucker decomposition is one of the SOTA CNN model compression techniques...
research
10/07/2020

Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models

Transformer models have garnered a lot of interest in recent years by de...
research
03/25/2023

Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

Recent vision architectures and self-supervised training methods enable ...
research
03/03/2023

Rotation Invariant Quantization for Model Compression

Post-training Neural Network (NN) model compression is an attractive app...
research
08/16/2023

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Large Language Models (LLMs) have achieved state-of-the-art performance ...

Please sign up or login with your details

Forgot password? Click here to reset