AI Chat AI Image Generator AI Video Text to Speech

Tone prediction and orthographic conversion for Basaa

10/13/2022

∙

by Ilya Nikitin, et al.

∙

∙

In this paper, we present a seq2seq approach for transliterating missionary Basaa orthographies into the official orthography. Our model uses pre-trained Basaa missionary and official orthography corpora using BERT. Since Basaa is a low-resource language, we have decided to use the mT5 model for our project. Before training our model, we pre-processed our corpora by eliminating one-to-one correspondences between spellings and unifying characters variably containing either one to two characters into single-character form. Our best mT5 model achieved a CER equal to 12.6747 and a WER equal to 40.1012.

Ilya Nikitin
1 publication
Brian O'Connor
1 publication
Anastasia Safonova
2 publications

page 1

page 2

page 3

page 4

research

∙ 01/26/2022

Neural Grapheme-to-Phoneme Conversion with Pre-trained Grapheme Models

Neural network models have achieved state-of-the-art performance on grap...

0 Lu Dong, et al. ∙

research

∙ 07/13/2022

Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models

Most of the Chinese pre-trained models adopt characters as basic units f...

0 Wenbiao Li, et al. ∙

research

∙ 05/17/2022

Geographical Distance Is The New Hyperparameter: A Case Study Of Finding The Optimal Pre-trained Language For English-isiZulu Machine Translation

Stemming from the limited availability of datasets and textual resources...

0 Muhammad Umair Nasir, et al. ∙

research

∙ 08/10/2020

KR-BERT: A Small-Scale Korean-Specific Language Model

Since the appearance of BERT, recent works including XLNet and RoBERTa u...

0 Sangah Lee, et al. ∙

research

∙ 03/06/2022

Capsule Networks for Character Recognition in Low Resource Languages

Most of the existing techniques in handwritten character recognition are...

0 Dulani Meedeniya, et al. ∙

research

∙ 10/24/2020

Char2Subword: Extending the Subword Embedding Space from Pre-trained Models Using Robust Character Compositionality

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokeni...

28 Gustavo Aguilar, et al. ∙

research

∙ 10/31/2020

Free the Plural: Unrestricted Split-Antecedent Anaphora Resolution

Now that the performance of coreference resolvers on the simpler forms o...

0 Juntao Yu, et al. ∙