Simple Recurrence Improves Masked Language Models

05/23/2022
by   Tao Lei, et al.
0

In this work, we explore whether modeling recurrence into the Transformer architecture can both be beneficial and efficient, by building an extremely simple recurrent module into the Transformer. We compare our model to baselines following the training and evaluation recipe of BERT. Our results confirm that recurrence can indeed improve Transformer models by a consistent margin, without requiring low-level performance optimizations, and while keeping the number of parameters constant. For example, our base model achieves an absolute improvement of 2.1 points averaged across 10 tasks and also demonstrates increased stability in fine-tuning over a range of learning rates.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/04/2020

Fine-tuning BERT for Low-Resource Natural Language Understanding via Active Learning

Recently, leveraging pre-trained Transformer based language models in do...
research
10/24/2020

Rethinking embedding coupling in pre-trained language models

We re-evaluate the standard practice of sharing weights between input an...
research
07/03/2023

Trainable Transformer in Transformer

Recent works attribute the capability of in-context learning (ICL) in la...
research
06/25/2022

Adversarial Self-Attention for Language Understanding

An ultimate language system aims at the high generalization and robustne...
research
02/01/2023

An Empirical Study on the Transferability of Transformer Modules in Parameter-Efficient Fine-Tuning

Parameter-efficient fine-tuning approaches have recently garnered a lot ...
research
06/11/2019

Cued@wmt19:ewc&lms

Two techniques provide the fabric of the Cambridge University Engineerin...
research
08/02/2020

The Chess Transformer: Mastering Play using Generative Language Models

This work demonstrates that natural language transformers can support mo...

Please sign up or login with your details

Forgot password? Click here to reset