Improved Language Modeling by Decoding the Past
Highly regularized LSTMs that model the auto-regressive conditional factorization of the joint probability distribution of words achieve state-of-the-art results in language modeling. These models have an implicit bias towards predicting the next word from a given context. We propose a new regularization term based on decoding words in the context from the predicted distribution of the next word. With relatively few additional parameters, our model achieves absolute improvements of 1.7% and 2.3% over the current state-of-the-art results on the Penn Treebank and WikiText-2 datasets.
READ FULL TEXT