Inference-only sub-character decomposition improves translation of unseen logographic characters

11/12/2020
by   Danielle Saunders, et al.
0

Neural Machine Translation (NMT) on logographic source languages struggles when translating `unseen' characters, which never appear in the training data. One possible approach to this problem uses sub-character decomposition for training and test sentences. However, this approach involves complete retraining, and its effectiveness for unseen character translation to non-logographic languages has not been fully explored. We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT, for both high-resource and low-resource domains. For each language pair and domain we construct a test set where all source sentences contain at least one unseen logographic character. We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally. We offer a simple alternative based on decomposition before inference for unseen characters only. Our approach allows flexible application, achieving translation adequacy improvements and requiring no additional models or training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/07/2018

Neural Machine Translation of Logographic Languages Using Sub-character Level Information

Recent neural machine translation (NMT) systems have been greatly improv...
research
05/03/2018

Apply Chinese Radicals Into Neural Machine Translation: Deeper Than Character Level

In neural machine translation (NMT), researchers face the challenge of u...
research
01/13/2021

Uzbek Cyrillic-Latin-Cyrillic Machine Transliteration

In this paper, we introduce a data-driven approach to transliterating Uz...
research
05/08/2018

Improving Character-level Japanese-Chinese Neural Machine Translation with Radicals as an Additional Input Feature

In recent years, Neural Machine Translation (NMT) has been proven to get...
research
11/23/2022

Breaking the Representation Bottleneck of Chinese Characters: Neural Machine Translation with Stroke Sequence Modeling

Existing research generally treats Chinese character as a minimum unit f...
research
09/05/2022

Rare but Severe Neural Machine Translation Errors Induced by Minimal Deletion: An Empirical Study on Chinese and English

We examine the inducement of rare but severe errors in English-Chinese a...

Please sign up or login with your details

Forgot password? Click here to reset