Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale

by   Laurent Sartran, et al.

Transformer language models that are trained on vast amounts of data have achieved remarkable success at various NLP benchmarks. Intriguingly, this success is achieved by models that lack an explicit modeling of hierarchical syntactic structures, which were hypothesized by decades of linguistic research to be necessary for good generalization. This naturally leaves a question: to what extent can we further improve the performance of Transformer language models, through an inductive bias that encourages the model to explain the data through the lens of recursive syntactic compositions? Although the benefits of modeling recursive syntax have been shown at the small data and model scales, it remains an open question whether – and to what extent – a similar design principle is still beneficial in the case of powerful Transformer language models that work well at scale. To answer these questions, we introduce Transformer Grammars – a novel class of Transformer language models that combine: (i) the expressive power, scalability, and strong performance of Transformers, and (ii) recursive syntactic compositions, which here are implemented through a special attention mask. We find that Transformer Grammars outperform various strong baselines on multiple syntax-sensitive language modeling evaluation metrics, in addition to sentence-level language modeling perplexity. Nevertheless, we find that the recursive syntactic composition bottleneck harms perplexity on document-level modeling, providing evidence that a different kind of memory mechanism – that works independently of syntactic structures – plays an important role in the processing of long-form text.


A Systematic Assessment of Syntactic Generalization in Neural Language Models

State-of-the-art neural network models have achieved dizzyingly low perp...

Oracle Linguistic Graphs Complement a Pretrained Transformer Language Model: A Cross-formalism Comparison

We examine the extent to which, in principle, linguistic graph represent...

Multi-scale Transformer Language Models

We investigate multi-scale transformer language models that learn repres...

Exposing Attention Glitches with Flip-Flop Language Modeling

Why do large language models sometimes output factual inaccuracies and e...

Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs

Most interpretability research in NLP focuses on understanding the behav...

What do Toothbrushes do in the Kitchen? How Transformers Think our World is Structured

Transformer-based models are now predominant in NLP. They outperform app...

Controlled Evaluation of Grammatical Knowledge in Mandarin Chinese Language Models

Prior work has shown that structural supervision helps English language ...

Please sign up or login with your details

Forgot password? Click here to reset