A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies

by   Hongyu He, et al.

Ever since their conception, Transformers have taken over traditional sequence models in many tasks, such as NLP, image classification, and video/audio processing, for their fast training and superior performance. Much of the merit is attributable to positional encoding and multi-head attention. However, Transformers fall short in learning long-range dependencies mainly due to the quadratic complexity scaled with context length, in terms of both time and space. Consequently, over the past five years, a myriad of methods has been proposed to make Transformers more efficient. In this work, we first take a step back, study and compare existing solutions to long-sequence modeling in terms of their pure mathematical formulation. Specifically, we summarize them using a unified template, given their shared nature of token mixing. Through benchmarks, we then demonstrate that long context length does yield better performance, albeit application-dependent, and traditional Transformer models fall short in taking advantage of long-range dependencies. Next, inspired by emerging sparse models of huge capacity, we propose a machine learning system for handling million-scale dependencies. As a proof of concept, we evaluate the performance of one essential component of this system, namely, the distributed multi-head attention. We show that our algorithm can scale up attention computation by almost 40× using four GeForce RTX 4090 GPUs, compared to vanilla multi-head attention mechanism. We believe this study is an instrumental step towards modeling million-scale dependencies.


Adaptive Multi-Resolution Attention with Linear Complexity

Transformers have improved the state-of-the-art across numerous tasks in...

Mega: Moving Average Equipped Gated Attention

The design choices in the Transformer attention mechanism, including wea...

Horizontal and Vertical Attention in Transformers

Transformers are built upon multi-head scaled dot-product attention and ...

GoodBye WaveNet – A Language Model for Raw Audio with Context of 1/2 Million Samples

Modeling long-term dependencies for audio signals is a particularly chal...

Causal Transformers Perform Below Chance on Recursive Nested Constructions, Unlike Humans

Recursive processing is considered a hallmark of human linguistic abilit...

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Transformer models have achieved state-of-the-art results across a diver...

Wake Word Detection with Streaming Transformers

Modern wake word detection systems usually rely on neural networks for a...

Please sign up or login with your details

Forgot password? Click here to reset