A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

09/13/2021
by   Alex Jones, et al.
0

In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 101 languages and 5,050 language pairs. Using BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our corpus, we compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance, as well as four intrinsic measures of vector space alignment and isomorphism. We then examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics. The results of our analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality. We also note in-family training data as a stronger predictor than language-specific training data across the board. We verify some of our linguistic findings by looking at the effect of morphological segmentation on English-Inuktitut alignment, in addition to examining the effect of word order agreement on isomorphism for 66 zero-shot language pairs from a different corpus. We make the data and code for our experiments publicly available.

READ FULL TEXT

page 8

page 9

research
10/27/2021

When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer

While recent work on multilingual language models has demonstrated their...
research
04/06/2020

A Systematic Analysis of Morphological Content in BERT Models for Multiple Languages

This work describes experiments which probe the hidden representations o...
research
09/07/2022

Improving the Cross-Lingual Generalisation in Visual Question Answering

While several benefits were realized for multilingual vision-language pr...
research
04/13/2021

Finding Concept-specific Biases in Form–Meaning Associations

This work presents an information-theoretic operationalisation of cross-...
research
09/10/2021

Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes

State-of-the-art contextual embeddings are obtained from large language ...
research
03/16/2020

HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual Morpheme Alignment

Twenty-five years ago, morphologically aligned Hebrew-Finnish and Greek-...
research
11/04/2020

Probing Multilingual BERT for Genetic and Typological Signals

We probe the layers in multilingual BERT (mBERT) for phylogenetic and ge...

Please sign up or login with your details

Forgot password? Click here to reset