The Devil in Linear Transformer

10/19/2022
by   Zhen Qin, et al.
0

Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures. To address these issues, we first identify that the scaling of attention matrices is the devil in unbounded gradients, which turns out unnecessary in linear attention as we show theoretically and empirically. To this end, we propose a new linear attention that replaces the scaling operation with a normalization to stabilize gradients. For the issue of attention dilution, we leverage a diagonal attention to confine attention to only neighbouring tokens in early layers. Benefiting from the stable gradients and improved attention, our new linear transformer model, transNormer, demonstrates superior performance on text classification and language modeling tasks, as well as on the challenging Long-Range Arena benchmark, surpassing vanilla transformer and existing linear variants by a clear margin while being significantly more space-time efficient. The code is available at https://github.com/OpenNLPLab/Transnormer .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/17/2022

cosFormer: Rethinking Softmax in Attention

Transformer has shown great successes in natural language processing, co...
research
07/18/2023

Linearized Relative Positional Encoding

Relative positional encoding is widely used in vanilla and linear transf...
research
11/08/2020

Long Range Arena: A Benchmark for Efficient Transformers

Transformers do not scale very well to long sequence lengths largely bec...
research
05/04/2023

On the Expressivity Role of LayerNorm in Transformers' Attention

Layer Normalization (LayerNorm) is an inherent component in all Transfor...
research
05/08/2023

Toeplitz Neural Network for Sequence Modeling

Sequence modeling has important applications in natural language process...
research
10/15/2021

On Learning the Transformer Kernel

In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, d...
research
05/18/2021

Relative Positional Encoding for Transformers with Linear Complexity

Recent advances in Transformer models allow for unprecedented sequence l...

Please sign up or login with your details

Forgot password? Click here to reset