Do Transformers Parse while Predicting the Masked Word?

by   Haoyu Zhao, et al.

Pre-trained language models have been shown to encode linguistic structures, e.g. dependency and constituency parse trees, in their embeddings while being trained on unsupervised loss functions like masked language modeling. Some doubts have been raised whether the models actually are doing parsing or only some computation weakly correlated with it. We study questions: (a) Is it possible to explicitly describe transformers with realistic embedding dimension, number of heads, etc. that are capable of doing parsing – or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG [Marcus et al, 1993]. We also show that the Inside-Outside algorithm is optimal for masked language modeling loss on the PCFG-generated data. We also give a construction of transformers with 50 layers, 15 attention heads, and 1275 dimensional embeddings in average such that using its embeddings it is possible to do constituency parsing with >70% F1 score on PTB dataset. We conduct probing experiments on models pre-trained on PCFG-generated data to show that this not only allows recovery of approximate parse tree, but also recovers marginal span probabilities computed by the Inside-Outside algorithm, which suggests an implicit bias of masked language modeling towards this algorithm.


page 13

page 15

page 22

page 24


Impact of Gender Debiased Word Embeddings in Language Modeling

Gender, race and social biases have recently been detected as evident ex...

Unsupervised and Few-shot Parsing from Pretrained Language Models

Pretrained language models are generally acknowledged to be able to enco...

Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI

Pre-trained language models have recently emerged as a powerful tool for...

Probing for Understanding of English Verb Classes and Alternations in Large Pre-trained Language Models

We investigate the extent to which verb alternation classes, as describe...

Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Autoencoders

We introduce deep inside-outside recursive autoencoders (DIORA), a fully...

Physics of Language Models: Part 1, Context-Free Grammar

We design experiments to study how generative language models, like GPT,...

Iterated Piecewise Affine (IPA) Approximation for Language Modeling

In this work, we demonstrate the application of a simple first-order Tay...

Please sign up or login with your details

Forgot password? Click here to reset