Does mBERT understand Romansh? Evaluating word embeddings using word alignment

06/14/2023
by   Eyal Liron Dolev, et al.
0

We test similarity-based word alignment models (SimAlign and awesome-align) in combination with word embeddings from mBERT and XLM-R on parallel sentences in German and Romansh. Since Romansh is an unseen language, we are dealing with a zero-shot setting. Using embeddings from mBERT, both models reach an alignment error rate of 0.22, which outperforms fast_align, a statistical model, and is on par with similarity-based word alignment for seen languages. We interpret these results as evidence that mBERT contains information that can be meaningful and applicable to Romansh. To evaluate performance, we also present a new trilingual corpus, which we call the DERMIT (DE-RM-IT) corpus, containing press releases made by the Canton of Grisons in German, Romansh and Italian in the past 25 years. The corpus contains 4 547 parallel documents and approximately 100 000 sentence pairs in each language combination. We additionally present a gold standard for German-Romansh word alignment. The data is available at https://github.com/eyldlv/DERMIT-Corpus.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2023

DEPLAIN: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

Text simplification is an intralingual translation task in which documen...
research
09/25/2019

Annotated Guidelines and Building Reference Corpus for Myanmar-English Word Alignment

Reference corpus for word alignment is an important resource for develop...
research
04/18/2020

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

Word alignments are useful for tasks like statistical and neural machine...
research
09/02/2022

A New Aligned Simple German Corpus

"Leichte Sprache", the German counterpart to Simple English, is a regula...
research
05/23/2022

Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment

Word translation without parallel corpora has become feasible, rivaling ...
research
06/07/2023

Can current NLI systems handle German word order? Investigating language model performance on a new German challenge set of minimal pairs

Compared to English, German word order is freer and therefore poses addi...
research
11/01/2018

A Stronger Baseline for Multilingual Word Embeddings

Levy, Søgaard and Goldberg's (2017) S-ID (sentence ID) method applies wo...

Please sign up or login with your details

Forgot password? Click here to reset