Unsupervised Statistical Machine Translation

09/04/2018
by   Mikel Artetxe, et al.
0

While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample et al., 2018). Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest. In this paper, we propose an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems. Our method profits from the modular architecture of SMT: we first induce a phrase table from monolingual corpora through cross-lingual embedding mappings, combine it with an n-gram language model, and fine-tune hyperparameters through an unsupervised MERT variant. In addition, iterative backtranslation improves results further, yielding, for instance, 14.08 and 26.22 BLEU points in WMT 2014 English-German and English-French, respectively, an improvement of more than 7-10 BLEU points over previous unsupervised systems, and closing the gap with supervised SMT (Moses trained on Europarl) down to 2-5 BLEU points. Our implementation is available at https://github.com/artetxem/monoses

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/04/2019

An Effective Approach to Unsupervised Machine Translation

While machine translation has traditionally relied on large amounts of p...
research
10/30/2017

Unsupervised Neural Machine Translation

In spite of the recent success of neural machine translation (NMT) in st...
research
07/29/2019

CUNI Systems for the Unsupervised News Translation Task in WMT 2019

In this paper we describe the CUNI translation system used for the unsup...
research
07/29/2016

Connecting Phrase based Statistical Machine Translation Adaptation

Although more additional corpora are now available for Statistical Machi...
research
10/09/2020

ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization

Cherokee is a highly endangered Native American language spoken by the C...
research
04/20/2018

Phrase-Based & Neural Unsupervised Machine Translation

Machine translation systems achieve near human-level performance on some...
research
06/03/2020

Multi-Agent Cross-Translated Diversification for Unsupervised Machine Translation

Recent unsupervised machine translation (UMT) systems usually employ thr...

Please sign up or login with your details

Forgot password? Click here to reset