Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

by   Junyan Li, et al.

Deploying pre-trained transformer models like BERT on downstream tasks in resource-constrained scenarios is challenging due to their high inference cost, which grows rapidly with input sequence length. In this work, we propose a constraint-aware and ranking-distilled token pruning method ToP, which selectively removes unnecessary tokens as input sequence passes through layers, allowing the model to improve online inference speed while preserving accuracy. ToP overcomes the limitation of inaccurate token importance ranking in the conventional self-attention mechanism through a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models. Then, ToP introduces a coarse-to-fine pruning approach that automatically selects the optimal subset of transformer layers and optimizes token pruning decisions within these layers through improved L_0 regularization. Extensive experiments on GLUE benchmark and SQuAD tasks demonstrate that ToP outperforms state-of-the-art token pruning and model compression methods with improved accuracy and speedups. ToP reduces the average FLOPs of BERT by 8.1x while achieving competitive accuracy on GLUE, and provides a real latency speedup of up to 7.4x on an Intel CPU.


page 1

page 2

page 3

page 4


Learned Token Pruning for Transformers

A major challenge in deploying transformer models is their prohibitive i...

Accelerating BERT Inference for Sequence Labeling via Early-Exit

Both performance and efficiency are crucial factors for sequence labelin...

How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking

Attribution methods assess the contribution of inputs (e.g., words) to t...

Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning

Pre-training and then fine-tuning large language models is commonly used...

A Study on Token Pruning for ColBERT

The ColBERT model has recently been proposed as an effective BERT based ...

HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

While vision transformers (ViTs) have continuously achieved new mileston...

Pyramid-BERT: Reducing Complexity via Successive Core-set based Token Selection

Transformer-based language models such as BERT have achieved the state-o...

Please sign up or login with your details

Forgot password? Click here to reset