DPRK-BERT: The Supreme Language Model

12/01/2021
by   Arda Akdemir, et al.
0

Deep language models have achieved remarkable success in the NLP domain. The standard way to train a deep language model is to employ unsupervised learning from scratch on a large unlabeled corpus. However, such large corpora are only available for widely-adopted and high-resource languages and domains. This study presents the first deep language model, DPRK-BERT, for the DPRK language. We achieve this by compiling the first unlabeled corpus for the DPRK language and fine-tuning a preexisting the ROK language model. We compare the proposed model with existing approaches and show significant improvements on two DPRK datasets. We also present a cross-lingual version of this model which yields better generalization across the two Korean languages. Finally, we provide various NLP tools related to the DPRK language that would foster future research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2019

MultiFiT: Efficient Multi-lingual Language Model Fine-tuning

Pretrained language models are promising particularly for low-resource l...
research
07/25/2019

Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER

Manually annotated corpora for low-resource languages are usually small ...
research
02/18/2022

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

Language models for historical states of language are becoming increasin...
research
05/28/2023

Rethinking Masked Language Modeling for Chinese Spelling Correction

In this paper, we study Chinese Spelling Correction (CSC) as a joint dec...
research
08/27/2021

Exploring the Capacity of a Large-scale Masked Language Model to Recognize Grammatical Errors

In this paper, we explore the capacity of a language model-based method ...
research
11/29/2020

Coarse-to-Fine Memory Matching for Joint Retrieval and Classification

We present a novel end-to-end language model for joint retrieval and cla...
research
08/25/2022

Training a T5 Using Lab-sized Resources

Training large neural language models on large datasets is resource- and...

Please sign up or login with your details

Forgot password? Click here to reset