Efficient Transformers with Dynamic Token Pooling

11/17/2022
by   Piotr Nawrot, et al.
0

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the intermediate layers by pooling fixed-length segments of tokens. Nevertheless, natural units of meaning, such as words or phrases, display varying sizes. To address this mismatch, we equip language models with a dynamic-pooling mechanism, which predicts segment boundaries in an autoregressive fashion. We compare several methods to infer boundaries, including end-to-end learning through stochastic re-parameterisation, supervised learning (based on segmentations from subword tokenizers or spikes in conditional entropy), as well as linguistically motivated boundaries. We perform character-level evaluation on texts from multiple datasets and morphologically diverse languages. The results demonstrate that dynamic pooling, which jointly segments and models language, is often both faster and more accurate than vanilla Transformers and fixed-length pooling within the same computational budget.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2021

Token Pooling in Vision Transformers

Despite the recent success in many applications, the high computational ...
research
12/03/2021

Make A Long Image Short: Adaptive Token Length for Vision Transformers

The vision transformer splits each image into a sequence of tokens with ...
research
09/13/2023

Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?

Convolutional networks and vision transformers have different forms of p...
research
08/13/2018

Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging

Character-level models of tokens have been shown to be effective at deal...
research
01/09/2019

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer networks have a potential of learning longer-term dependency...
research
03/22/2022

Self-supervision through Random Segments with Autoregressive Coding (RandSAC)

Inspired by the success of self-supervised autoregressive representation...
research
06/23/2021

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

State-of-the-art models in natural language processing rely on separate ...

Please sign up or login with your details

Forgot password? Click here to reset