SMYRF: Efficient Attention using Asymmetric Clustering

10/11/2020
by   Giannis Daras, et al.
34

We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from O(N^2) to O(N log N), where N is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. queries and keys share the same vector representations) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using 50% less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on BigGAN on CelebA-HQ.

READ FULL TEXT

page 6

page 18

page 25

research
06/11/2019

What Does BERT Look At? An Analysis of BERT's Attention

Large pre-trained neural networks such as BERT have had great recent suc...
research
06/13/2021

Memory-efficient Transformers via Top-k Attention

Following the success of dot-product attention in Transformers, numerous...
research
05/28/2021

Linear-Time Self Attention with Codeword Histogram for Efficient Recommendation

Self-attention has become increasingly popular in a variety of sequence ...
research
03/14/2023

Input-length-shortening and text generation via attention values

Identifying words that impact a task's performance more than others is a...
research
06/05/2020

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Recent progress in pre-trained neural language models has significantly ...
research
07/16/2020

Translate Reverberated Speech to Anechoic Ones: Speech Dereverberation with BERT

Single channel speech dereverberation is considered in this work. Inspir...
research
05/17/2023

Exploring the Space of Key-Value-Query Models with Intention

Attention-based models have been a key element of many recent breakthrou...

Please sign up or login with your details

Forgot password? Click here to reset