DARTFormer: Finding The Best Type Of Attention

10/02/2022
by   Jason Ross Brown, et al.
0

Given the wide and ever growing range of different efficient Transformer attention mechanisms, it is important to identify which attention is most effective when given a task. In this work, we are also interested in combining different attention types to build heterogeneous Transformers. We first propose a DARTS-like Neural Architecture Search (NAS) method to find the best attention for a given task, in this setup, all heads use the same attention (homogeneous models). Our results suggest that NAS is highly effective on this task, and it identifies the best attention mechanisms for IMDb byte level text classification and Listops. We then extend our framework to search for and build Transformers with multiple different attention types, and call them heterogeneous Transformers. We show that whilst these heterogeneous Transformers are better than the average homogeneous models, they cannot outperform the best. We explore the reasons why heterogeneous attention makes sense, and why it ultimately fails.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/28/2022

Neural Architecture Search on Efficient Transformers and Beyond

Recently, numerous efficient Transformers have been proposed to reduce t...
research
09/25/2022

Bigger Faster: Two-stage Neural Architecture Search for Quantized Transformer Models

Neural architecture search (NAS) for transformers has been used to creat...
research
07/01/2023

AutoST: Training-free Neural Architecture Search for Spiking Transformers

Spiking Transformers have gained considerable attention because they ach...
research
03/14/2023

Learning to Grow Artificial Hippocampi in Vision Transformers for Resilient Lifelong Learning

Lifelong learning without catastrophic forgetting (i.e., resiliency) pos...
research
10/02/2022

Wide Attention Is The Way Forward For Transformers

The Transformer is an extremely powerful and prominent deep learning arc...
research
02/15/2022

XAI for Transformers: Better Explanations through Conservative Propagation

Transformers have become an important workhorse of machine learning, wit...
research
05/23/2022

FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?

The existence of a plethora of language models makes the problem of sele...

Please sign up or login with your details

Forgot password? Click here to reset