A parallel evaluation data set of software documentation with document structure annotation

08/11/2020
by   Bianka Buschbeck, et al.
0

This paper accompanies the software documentation data set for machine translation, a parallel evaluation data set of data originating from the SAP Help Portal, that we release to the machine translation community for research purposes. It offers the possibility to tune and evaluate machine translation systems in the domain of corporate software documentation and contributes to the availability of a wider range of evaluation scenarios. The data set comprises of the language pairs English to Hindi, Indonesian, Malay and Thai, and thus also increases the test coverage for the many low-resource language pairs. Unlike most evaluation data sets that consist of plain parallel text, the segments in this data set come with additional metadata that describes structural information of the document context. We provide insights into the origin and creation, the particularities and characteristics of the data set.

READ FULL TEXT
research
09/20/2020

Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation

Despite being the seventh most widely spoken language in the world, Beng...
research
12/13/2022

Towards a general purpose machine translation system for Sranantongo

Machine translation for Sranantongo (Sranan, srn), a low-resource Creole...
research
08/04/2023

Sinhala-English Parallel Word Dictionary Dataset

Parallel datasets are vital for performing and evaluating any kind of mu...
research
02/13/2018

Examining the Tip of the Iceberg: A Data Set for Idiom Translation

Neural Machine Translation (NMT) has been widely used in recent years wi...
research
10/13/2020

The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT

This paper describes the development of a new benchmark for machine tran...
research
01/05/2022

Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Common designs of model evaluation typically focus on monolingual settin...
research
04/11/2022

Toward More Effective Human Evaluation for Machine Translation

Improvements in text generation technologies such as machine translation...

Please sign up or login with your details

Forgot password? Click here to reset