Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation

05/09/2021
by   Zihan Liu, et al.
11

The data scarcity in low-resource languages has become a bottleneck to building robust neural machine translation systems. Fine-tuning a multilingual pre-trained model (e.g., mBART (Liu et al., 2020)) on the translation task is a good approach for low-resource languages; however, its performance will be greatly limited when there are unseen languages in the translation pairs. In this paper, we present a continual pre-training (CPT) framework on mBART to effectively adapt it to unseen languages. We first construct noisy mixed-language text from the monolingual corpus of the target language in the translation pair to cover both the source and target languages, and then, we continue pre-training mBART to reconstruct the original monolingual text. Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline, as well as other strong baselines, across all tested low-resource translation pairs containing unseen languages. Furthermore, our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training. The code is available at https://github.com/zliucr/cpt-nmt.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2021

EdinSaar@WMT21: North-Germanic Low-Resource Multilingual NMT

We describe the EdinSaar submission to the shared task of Multilingual L...
research
10/07/2020

Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information

We investigate the following question for machine translation (MT): can ...
research
10/16/2021

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

We show that unsupervised sequence-segmentation performance can be trans...
research
03/28/2023

Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages

Neural text-to-speech (TTS) models can synthesize natural human speech w...
research
04/24/2020

Practical Comparable Data Collection for Low-Resource Languages via Images

We propose a method of curating high-quality comparable training data fo...
research
10/12/2022

Using Massive Multilingual Pre-Trained Language Models Towards Real Zero-Shot Neural Machine Translation in Clinical Domain

Massively multilingual pre-trained language models (MMPLMs) are develope...
research
01/20/2022

Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation

In the present study, we propose novel sequence-to-sequence pre-training...

Please sign up or login with your details

Forgot password? Click here to reset