Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

12/20/2021
by   Matej Ulčar, et al.
0

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/14/2020

FinEst BERT and CroSloEngual BERT: less is more in multilingual models

Large pretrained masked language models have become state-of-the-art sol...
research
11/10/2019

CamemBERT: a Tasty French Language Model

Pretrained language models are now ubiquitous in Natural Language Proces...
research
03/25/2021

Bertinho: Galician BERT Representations

This paper presents a monolingual BERT model for Galician. We follow the...
research
06/02/2023

Data-Efficient French Language Modeling with CamemBERTa

Recent advances in NLP have significantly improved the performance of la...
research
02/22/2021

RUBERT: A Bilingual Roman Urdu BERT Using Cross Lingual Transfer Learning

In recent studies, it has been shown that Multilingual language models u...
research
02/26/2019

Polyglot Contextual Representations Improve Crosslingual Transfer

We introduce a method to produce multilingual contextual word representa...

Please sign up or login with your details

Forgot password? Click here to reset