Pruning Attention Heads of Transformer Models Using A* Search: A Novel Approach to Compress Big NLP Architectures

by   Archit Parnami, et al.

Recent years have seen a growing adoption of Transformer models such as BERT in Natural Language Processing and even in Computer Vision. However, due to the size, there has been limited adoption of such models within resource-constrained computing environments This paper proposes novel pruning algorithms to compress transformer models by eliminating redundant Attention Heads. We apply the A* search algorithm to obtain a pruned model with minimal accuracy guarantees. Our results indicate that the method could eliminate as much as 40 loss in accuracy.


page 3

page 7

page 11

page 16

page 17

page 18

page 19

page 20


Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model

The prevalence of Transformer-based pre-trained language models (PLMs) h...

An Automatic and Efficient BERT Pruning for Edge AI Systems

With the yearning for deep learning democratization, there are increasin...

The Topological BERT: Transforming Attention into Topology for Natural Language Processing

In recent years, the introduction of the Transformer models sparked a re...

Rethinking Network Pruning – under the Pre-train and Fine-tune Paradigm

Transformer-based pre-trained language models have significantly improve...

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Self-attention is a key enabler of state-of-art accuracy for various tra...

Do Transformer Modifications Transfer Across Implementations and Applications?

The research community has proposed copious modifications to the Transfo...

DoT: An efficient Double Transformer for NLP tasks with tables

Transformer-based approaches have been successfully used to obtain state...

Please sign up or login with your details

Forgot password? Click here to reset