Cure the headache of Transformers via Collinear Constrained Attention

09/15/2023
by   Shiyi Zhu, et al.
0

As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We've coined this discovery the "headache of Transformers". To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA's computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we've made our code available in the appendix for reappearing experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/17/2021

ENCONTER: Entity Constrained Progressive Sequence Generation via Insertion-based Transformer

Pretrained using large amount of data, autoregressive language models ar...
research
12/24/2021

SimViT: Exploring a Simple Vision Transformer with sliding windows

Although vision Transformers have achieved excellent performance as back...
research
07/05/2023

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large...
research
09/18/2023

Deep Prompt Tuning for Graph Transformers

Graph transformers have gained popularity in various graph-based tasks b...
research
12/08/2021

Improving language models by retrieving from trillions of tokens

We enhance auto-regressive language models by conditioning on document c...
research
08/23/2022

Efficient Attention-free Video Shift Transformers

This paper tackles the problem of efficient video recognition. In this a...
research
07/05/2022

CASHformer: Cognition Aware SHape Transformer for Longitudinal Analysis

Modeling temporal changes in subcortical structures is crucial for a bet...

Please sign up or login with your details

Forgot password? Click here to reset