Landmark Attention: Random-Access Infinite Context Length for Transformers

by   Amirkeivan Mohtashami, et al.

While transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity up to 32k tokens, allowing for inference at the context lengths of GPT-4.


page 1

page 2

page 3

page 4


ETC: Encoding Long and Structured Data in Transformers

Transformer-based models have pushed the state of the art in many natura...

Make A Long Image Short: Adaptive Token Length for Vision Transformers

The vision transformer splits each image into a sequence of tokens with ...

∞-former: Infinite Memory Transformer

Transformers struggle when attending to long contexts, since the amount ...

Big Bird: Transformers for Longer Sequences

Transformers-based models, such as BERT, have been one of the most succe...

Scaling Transformer to 1M tokens and beyond with RMT

This technical report presents the application of a recurrent memory to ...

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Although dominant in natural language processing, transformer-based mode...

Focused Transformer: Contrastive Training for Context Scaling

Large language models have an exceptional capability to incorporate new ...

Please sign up or login with your details

Forgot password? Click here to reset