SHAQ: Single Headed Attention with Quasi-Recurrence

08/18/2021
by   Nashwin Bharwani, et al.
0

Natural Language Processing research has recently been dominated by large scale transformer models. Although they achieve state of the art on many important language tasks, transformers often require expensive compute resources, and days spanning to weeks to train. This is feasible for researchers at big tech companies and leading research universities, but not for scrappy start-up founders, students, and independent researchers. Stephen Merity's SHA-RNN, a compact, hybrid attention-RNN model, is designed for consumer-grade modeling as it requires significantly fewer parameters and less training time to reach near state of the art results. We analyze Merity's model here through an exploratory model analysis over several units of the architecture considering both training time and overall quality in our assessment. Ultimately, we combine these findings into a new architecture which we call SHAQ: Single Headed Attention Quasi-recurrent Neural Network. With our new architecture we achieved similar accuracy results as the SHA-RNN while accomplishing a 4x speed boost in training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2021

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

Large language models have become increasingly difficult to train becaus...
research
02/27/2020

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Transformer-based models pre-trained on large-scale corpora achieve stat...
research
11/26/2019

Single Headed Attention RNN: Stop Thinking With Your Head

The leading approaches in language modeling are all obsessed with TV sho...
research
09/05/2019

Accelerating Transformer Decoding via a Hybrid of Self-attention and Recurrent Neural Network

Due to the highly parallelizable architecture, Transformer is faster to ...
research
11/26/2019

An Optimized and Energy-Efficient Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks

Recurrent neural networks (RNN) have been successfully applied to variou...
research
09/19/2023

MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods

Recent research in decoding methods for Natural Language Generation (NLG...

Please sign up or login with your details

Forgot password? Click here to reset