Tackling the Low-resource Challenge for Canonical Segmentation

10/06/2020
by   Manuel Mager, et al.
0

Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the high-resource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua. We explore two new models for the task, borrowing from the closely related area of morphological generation: an LSTM pointer-generator and a sequence-to-sequence model with hard monotonic attention trained with imitation learning. We find that, in the low-resource setting, the novel approaches outperform existing ones on all languages by up to 11.4 However, while accuracy in emulated low-resource scenarios is over 50 languages, for the truly low-resource languages Popoluca and Tepehua, our best model only obtains 37.4 that canonical segmentation is still a challenging task for low-resource languages.

READ FULL TEXT
research
08/30/2017

Cross-lingual, Character-Level Neural Morphological Tagging

Even for common NLP tasks, sufficient supervision is not available in ma...
research
10/12/2022

Subword Segmental Language Modelling for Nguni Languages

Subwords have become the standard units of text in NLP, enabling efficie...
research
08/29/2023

Taxonomic Loss for Morphological Glossing of Low-Resource Languages

Morpheme glossing is a critical task in automated language documentation...
research
10/26/2022

Modeling the Graphotactics of Low-Resource Languages Using Sequential GANs

Generative Adversarial Networks (GANs) have been shown to aid in the cre...
research
04/01/2021

Canonical and Surface Morphological Segmentation for Nguni Languages

Morphological Segmentation involves decomposing words into morphemes, th...
research
01/05/2022

Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Common designs of model evaluation typically focus on monolingual settin...
research
04/17/2018

Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages

Morphological segmentation for polysynthetic languages is challenging, b...

Please sign up or login with your details

Forgot password? Click here to reset