The crystallization of modeling methods around the Transformer architect...
We investigate the ability of language models to perform compositional
r...
Transformers typically require some form of positional encoding, such as...
Since the introduction of the transformer model by Vaswani et al. (2017)...
We explore the benefits of decreasing the input length of transformers.
...
Multilayer transformer networks consist of interleaved self-attention an...
Although SGD requires shuffling the training data between epochs, curren...
In NMT, how far can we get without attention and without separate encodi...
Generative Adversarial Networks (GANs) have shown great promise recently...
We study the topmost weight matrix of neural network language models. We...