MiLMo:Minority Multilingual Pre-trained Language Model

12/04/2022
by   Hanru Shi, et al.
0

Pre-trained language models are trained on large-scale unsupervised data, and they can be fine-tuned on small-scale labeled datasets and achieve good results. Multilingual pre-trained language models can be trained on multiple languages and understand multiple languages at the same time. At present, the research on pre-trained models mainly focuses on rich-resource language, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained language model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained language model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on https://milmo.cmli-nlp.com.

READ FULL TEXT

page 6

page 8

research
05/15/2022

TiBERT: Tibetan Pre-trained Language Model

The pre-trained language model is trained on large-scale unlabeled text ...
research
02/28/2022

CINO: A Chinese Minority Pre-trained Language Model

Multilingual pre-trained language models have shown impressive performan...
research
12/03/2021

Multilingual Text Classification for Dravidian Languages

As the fourth largest language family in the world, the Dravidian langua...
research
12/03/2020

GottBERT: a pure German Language Model

Lately, pre-trained language models advanced the field of natural langua...
research
07/21/2021

Comparison of Czech Transformers on Text Classification Tasks

In this paper, we present our progress in pre-training monolingual Trans...
research
10/06/2022

Augmentor or Filter? Reconsider the Role of Pre-trained Language Model in Text Classification Augmentation

Text augmentation is one of the most effective techniques to solve the c...
research
11/07/2022

AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

In recent years, multilingual pre-trained language models have gained pr...

Please sign up or login with your details

Forgot password? Click here to reset