ZipLM: Hardware-Aware Structured Pruning of Language Models

02/07/2023
by   Eldar Kurtic, et al.
0

The breakthrough performance of large language models (LLMs) comes with large computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a new structured compression approach for LLMs, called ZipLM, which provides state-of-the-art compression-vs-accuracy results, while guaranteeing to match a set of (achievable) target speedups on any given target hardware. Specifically, given a task, a model, an inference environment, as well as a set of speedup targets, ZipLM identifies and removes redundancies in the model through iterative structured shrinking of the model's weight matrices. Importantly, ZipLM works in both, the post-training/one-shot and the gradual compression setting, where it produces a set of accurate models in a single run, making it highly-efficient in practice. Our approach is based on new structured pruning and knowledge distillation techniques, and consistently outperforms prior structured compression methods in terms of accuracy-versus-speedup in experiments on BERT- and GPT-family models. In particular, when compressing GPT2 model, it outperforms DistilGPT2 while being 60 Further, ZipLM matches performance of heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large architecture, and outperforms all prior BERT-base compression techniques like CoFi, MiniLM and TinyBERT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/14/2022

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Pre-trained Transformer-based language models have become a key building...
research
05/24/2023

PruMUX: Augmenting Data Multiplexing with Model Compression

As language models increase in size by the day, methods for efficient in...
research
03/03/2023

R-TOSS: A Framework for Real-Time Object Detection using Semi-Structured Pruning

Object detectors used in autonomous vehicles can have high memory and co...
research
12/15/2018

A Low Effort Approach to Structured CNN Design Using PCA

Deep learning models hold state of the art performance in many fields, y...
research
10/08/2021

Performance optimizations on deep noise suppression models

We study the role of magnitude structured pruning as an architecture sea...
research
05/25/2022

Sparse*BERT: Sparse Models are Robust

Large Language Models have become the core architecture upon which most ...
research
12/27/2022

DeepCuts: Single-Shot Interpretability based Pruning for BERT

As language models have grown in parameters and layers, it has become mu...

Please sign up or login with your details

Forgot password? Click here to reset