NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

05/31/2022
by   Genta Indra Winata, et al.
5

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/24/2022

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

NLP research is impeded by a lack of resources and awareness of the chal...
research
05/24/2023

GlobalBench: A Benchmark for Global Progress in Natural Language Processing

Despite the major advances in NLP, significant disparities in NLP system...
research
05/01/2023

Low-Resourced Machine Translation for Senegalese Wolof Language

Natural Language Processing (NLP) research has made great advancements i...
research
03/13/2020

Masakhane – Machine Translation For Africa

Africa has over 2000 languages. Despite this, African languages account ...
research
06/15/2022

Location-based Twitter Filtering for the Creation of Low-Resource Language Datasets in Indonesian Local Languages

Twitter contains an abundance of linguistic data from the real world. We...
research
11/28/2022

Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources

While the NLP community is generally aware of resource disparities among...
research
04/06/2022

Language Resources and Technologies for Non-Scheduled and Endangered Indian Languages

In the present paper, we will present a survey of the language resources...

Please sign up or login with your details

Forgot password? Click here to reset