Don't Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data

06/12/2021
by   Rajat Bhatnagar, et al.
0

High-performing machine translation (MT) systems can help overcome language barriers while making it possible for everyone to communicate and use language technologies in the language of their choice. However, such systems require large amounts of parallel sentences for training, and translators can be difficult to find and expensive. Here, we present a data collection strategy for MT which, in contrast, is cheap and simple, as it does not require bilingual speakers. Based on the insight that humans pay specific attention to movements, we use graphics interchange formats (GIFs) as a pivot to collect parallel sentences from monolingual annotators. We use our strategy to collect data in Hindi, Tamil and English. As a baseline, we also collect data using images as a pivot. We perform an intrinsic evaluation by manually evaluating a subset of the sentence pairs and an extrinsic evaluation by finetuning mBART on the collected data. We find that sentences collected via GIFs are indeed of higher quality.

READ FULL TEXT
research
07/06/2020

Bilingual Dictionary Based Neural Machine Translation without Using Parallel Sentences

In this paper, we propose a new task of machine translation (MT), which ...
research
02/04/2019

Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

The vast majority of language pairs in the world are low-resource becaus...
research
03/27/2019

Using Monolingual Data in Neural Machine Translation: a Systematic Study

Neural Machine Translation (MT) has radically changed the way systems ar...
research
08/30/2019

Bilingual is At Least Monolingual (BALM): A Novel Translation Algorithm that Encodes Monolingual Priors

State-of-the-art machine translation (MT) models do not use knowledge of...
research
06/11/2023

Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction

Neural models have drastically advanced state of the art for machine tra...
research
06/02/2021

Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation

Self-training has proven effective for improving NMT performance by augm...
research
07/09/2023

Towards cross-language prosody transfer for dialog

Speech-to-speech translation systems today do not adequately support use...

Please sign up or login with your details

Forgot password? Click here to reset