Adaptive Multi-Resolution Attention with Linear Complexity

08/10/2021
by   Yao Zhang, et al.
0

Transformers have improved the state-of-the-art across numerous tasks in sequence modeling. Besides the quadratic computational and memory complexity w.r.t the sequence length, the self-attention mechanism only processes information at the same scale, i.e., all attention heads are in the same resolution, resulting in the limited power of the Transformer. To remedy this, we propose a novel and efficient structure named Adaptive Multi-Resolution Attention (AdaMRA for short), which scales linearly to sequence length in terms of time and space. Specifically, we leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion. Moreover, to capture the potential relations between query representation and clues of different attention granularities, we leave the decision of which resolution of attention to use to query, which further improves the model's capacity compared to vanilla Transformer. In an effort to reduce complexity, we adopt kernel attention without degrading the performance. Extensive experiments on several benchmarks demonstrate the effectiveness and efficiency of our model by achieving a state-of-the-art performance-efficiency-memory trade-off. To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2020

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving s...
research
06/03/2021

Luna: Linear Unified Nested Attention

The quadratic computational and memory complexities of the Transformer's...
research
02/13/2023

A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies

Ever since their conception, Transformers have taken over traditional se...
research
03/29/2022

Efficient Localness Transformer for Smart Sensor-Based Energy Disaggregation

Modern smart sensor-based energy management systems leverage non-intrusi...
research
03/26/2021

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. How...
research
07/15/2019

Agglomerative Attention

Neural networks using transformer-based architectures have recently demo...
research
06/02/2023

RITA: Group Attention is All You Need for Timeseries Analytics

Timeseries analytics is of great importance in many real-world applicati...

Please sign up or login with your details

Forgot password? Click here to reset