Simplify-then-Translate: Automatic Preprocessing for Black-Box Machine Translation

by   Sneha Mehta, et al.

Black-box machine translation systems have proven incredibly useful for a variety of applications yet by design are hard to adapt, tune to a specific domain, or build on top of. In this work, we introduce a method to improve such systems via automatic pre-processing (APP) using sentence simplification. We first propose a method to automatically generate a large in-domain paraphrase corpus through back-translation with a black-box MT system, which is used to train a paraphrase model that "simplifies" the original sentence to be more conducive for translation. The model is used to preprocess source sentences of multiple low-resource language pairs. We show that this preprocessing leads to better translation performance as compared to non-preprocessed source sentences. We further perform side-by-side human evaluation to verify that translations of the simplified sentences are better than the original ones. Finally, we provide some guidance on recommended language pairs for generating the simplification model corpora by investigating the relationship between ease of translation of a language pair (as measured by BLEU) and quality of the resulting simplification model from back-translations of this language pair (as measured by SARI), and tie this into the downstream task of low-resource translation.


page 1

page 2

page 3

page 4


BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation

Mined bitexts can contain imperfect translations that yield unreliable t...

Improving a Multi-Source Neural Machine Translation Model with Corpus Extension for Low-Resource Languages

In machine translation, we often try to collect resources to improve its...

Automatic Testing and Improvement of Machine Translation

This paper presents TransRepair, a fully automatic approach for testing ...

Imitation Attacks and Defenses for Black-box Machine Translation Systems

We consider an adversary looking to steal or attack a black-box machine ...

Filling Gender & Number Gaps in Neural Machine Translation with Black-box Context Injection

When translating from a language that does not morphologically mark info...

Beyond Glass-Box Features: Uncertainty Quantification Enhanced Quality Estimation for Neural Machine Translation

Quality Estimation (QE) plays an essential role in applications of Machi...

uniblock: Scoring and Filtering Corpus with Unicode Block Information

The preprocessing pipelines in Natural Language Processing usually invol...

Please sign up or login with your details

Forgot password? Click here to reset