Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language

03/14/2021
by   Bonaventure F. P. Dossou, et al.
0

Building effective neural machine translation (NMT) models for very low-resourced and morphologically rich African indigenous languages is an open challenge. Besides the issue of finding available resources for them, a lot of work is put into preprocessing and tokenization. Recent studies have shown that standard tokenization methods do not always adequately deal with the grammatical, diacritical, and tonal properties of some African languages. That, coupled with the extremely low availability of training samples, hinders the production of reliable NMT models. In this paper, using Fon language as a case study, we revisit standard tokenization methods and introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training. Furthermore, we compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/02/2020

Morphological Word Segmentation on Agglutinative Languages for Neural Machine Translation

Neural machine translation (NMT) has achieved impressive performance on ...
research
05/18/2018

Combining Advanced Methods in Japanese-Vietnamese Neural Machine Translation

Neural machine translation (NMT) systems have recently obtained state-of...
research
10/15/2019

On the Importance of Word Boundaries in Character-level Neural Machine Translation

Neural Machine Translation (NMT) models generally perform translation us...
research
08/10/2022

How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?

Neural Machine Translation (NMT) is an open vocabulary problem. As a res...
research
10/30/2019

A Latent Morphology Model for Open-Vocabulary Neural Machine Translation

Translation into morphologically-rich languages challenges neural machin...
research
04/05/2020

Neural Machine Translation with Imbalanced Classes

We cast neural machine translation (NMT) as a classification task in an ...
research
04/28/2020

Assessing the Bilingual Knowledge Learned by Neural Machine Translation Models

Machine translation (MT) systems translate text between different langua...

Please sign up or login with your details

Forgot password? Click here to reset