MMCR4NLP: Multilingual Multiway Corpora Repository for Natural Language Processing

10/03/2017
by   Raj Dabre, et al.
0

Multilinguality is gradually becoming ubiquitous in the sense that more and more researchers have successfully shown that using additional languages help improve the results in many Natural Language Processing tasks. Multilingual Multiway Corpora (MMC) contain the same sentence in multiple languages. Such corpora have been primarily used for Multi-Source and Pivot Language Machine Translation but are also useful for developing multilingual sequence taggers by transfer learning. While these corpora are available, they are not organized for multilingual experiments and researchers need to write boilerplate code every time they want to use said corpora. Moreover, because there is no official MMC collection it becomes difficult to compare against existing approaches. As such we present our work on creating a unified and systematically organized repository of MMC spanning a large number of languages. We also provide training, development and test splits for corpora where official splits are unavailable. We hope that this will help speed up the pace of multilingual NLP research and ensure that NLP researchers obtain results that are more trustable since they can be compared easily. We indicate corpora sources, extraction procedures if any and relevant statistics. We also make our collection public for research purposes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2021

ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

With more than 7000 languages worldwide, multilingual natural language p...
research
03/22/2021

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

With the success of large-scale pre-training and multilingual modeling i...
research
08/15/2023

A User-Centered Evaluation of Spanish Text Simplification

We present an evaluation of text simplification (TS) in Spanish for a pr...
research
05/16/2018

DINFRA: A One Stop Shop for Computing Multilingual Semantic Relatedness

This demonstration presents an infrastructure for computing multilingual...
research
03/26/2019

A New Approach for Semi-automatic Building and Extending a Multilingual Terminology Thesaurus

This paper describes a new system for semi-automatically building, exten...
research
03/07/2023

Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora

This paper introduces two multilingual government themed corpora in vari...
research
07/03/2020

El Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks

Pre-training large-scale language models (LMs) requires huge amounts of ...

Please sign up or login with your details

Forgot password? Click here to reset