Wide Attention Is The Way Forward For Transformers

by   Jason Ross Brown, et al.

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is then studied systematically. This ratio balances the number of layers and the number of attention heads per layer while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3 counterparts. We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable. For example, a single layer Transformer on the IMDb byte level text classification has 3.1x faster inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. Our results suggest that the critical direction for building better Transformers for NLP is their width, and that their depth is less relevant.


page 7

page 13


Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models

Transformer models have garnered a lot of interest in recent years by de...

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

Transformers are ubiquitous in Natural Language Processing (NLP) tasks, ...

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Since hardware resources are limited, the objective of training deep lea...

Are More Layers Beneficial to Graph Transformers?

Despite that going deep has proven successful in many neural architectur...

Transformers: State-of-the-art Natural Language Processing

Recent advances in modern Natural Language Processing (NLP) research hav...

DARTFormer: Finding The Best Type Of Attention

Given the wide and ever growing range of different efficient Transformer...

Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

After their successful debut in natural language processing, Transformer...

Please sign up or login with your details

Forgot password? Click here to reset