We revisit the design choices in Transformers, and propose methods to ad...
We present a combined scaling method called BASIC that achieves 85.7
zer...
Large Transformer models have been central to recent advances in natural...
With recent progress in joint modeling of visual and textual representat...
Transformers provide a class of expressive architectures that are extrem...
Transformers have attracted increasing interests in computer vision, but...
Transformers have become one of the most important architectural innovat...
With a large amount of parallel data, neural machine translation systems...
With the success of language pretraining, it is highly desirable to deve...
Many training algorithms of a deep neural network can be interpreted as
...
We show state-of-the-art word representation learning methods maximize a...
With the capability of modeling bidirectional contexts, denoising
autoen...
Despite its success, deep learning still needs large labeled datasets to...
With latent variables, stochastic recurrent models have achieved
state-o...
Transformer networks have a potential of learning longer-term dependency...
When labeled data is scarce for a specific target task, transfer learnin...
Mixture of Softmaxes (MoS) has been shown to be effective at addressing ...
In this work, we examine methods for data augmentation for text-based ta...
In this work, we study the credit assignment problem in reward augmented...
We formulate language modeling as a matrix factorization problem, and sh...
Cloze test is widely adopted in language exams to evaluate students' lan...
Learning meaningful representations that maintain the content necessary ...
Semi-supervised learning methods based on generative adversarial network...
Knowledge bases are important resources for a variety of natural languag...
How can we enable computers to automatically answer questions like "Who
...