Cure the headache of Transformers via Collinear Constrained Attention

by   Shiyi Zhu, et al.

As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We've coined this discovery the "headache of Transformers". To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA's computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we've made our code available in the appendix for reappearing experiments.


page 1

page 2

page 3

page 4


ENCONTER: Entity Constrained Progressive Sequence Generation via Insertion-based Transformer

Pretrained using large amount of data, autoregressive language models ar...

SimViT: Exploring a Simple Vision Transformer with sliding windows

Although vision Transformers have achieved excellent performance as back...

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large...

Deep Prompt Tuning for Graph Transformers

Graph transformers have gained popularity in various graph-based tasks b...

Improving language models by retrieving from trillions of tokens

We enhance auto-regressive language models by conditioning on document c...

Efficient Attention-free Video Shift Transformers

This paper tackles the problem of efficient video recognition. In this a...

CASHformer: Cognition Aware SHape Transformer for Longitudinal Analysis

Modeling temporal changes in subcortical structures is crucial for a bet...

Please sign up or login with your details

Forgot password? Click here to reset