Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

11/13/2017
by   Yining Wang, et al.
0

Neural machine translation (NMT), a new approach to machine translation, has been proved to outperform conventional statistical machine translation (SMT) across a variety of language pairs. Translation is an open-vocabulary problem, but most existing NMT systems operate with a fixed vocabulary, which causes the incapability of translating rare words. This problem can be alleviated by using different translation granularities, such as character, subword and hybrid word-character. Translation involving Chinese is one of the most difficult tasks in machine translation, however, to the best of our knowledge, there has not been any other work exploring which translation granularity is most suitable for Chinese in NMT. In this paper, we conduct an extensive comparison using Chinese-English NMT as a case study. Furthermore, we discuss the advantages and disadvantages of various translation granularities in detail. Our experiments show that subword model performs best for Chinese-to-English translation with the vocabulary which is not so big while hybrid word-character model is most suitable for English-to-Chinese translation. Moreover, experiments of different granularities show that Hybrid_BPE method can achieve best result on Chinese-to-English translation task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/07/2019

SubCharacter Chinese-English Neural Machine Translation with Wubi encoding

Neural machine translation (NMT) is one of the best methods for understa...
research
05/03/2018

Apply Chinese Radicals Into Neural Machine Translation: Deeper Than Character Level

In neural machine translation (NMT), researchers face the challenge of u...
research
09/05/2022

Rare but Severe Neural Machine Translation Errors Induced by Minimal Deletion: An Empirical Study on Chinese and English

We examine the inducement of rare but severe errors in English-Chinese a...
research
09/30/2022

Blur the Linguistic Boundary: Interpreting Chinese Buddhist Sutra in English via Neural Machine Translation

Buddhism is an influential religion with a long-standing history and pro...
research
10/02/2018

Optimally Segmenting Inputs for NMT Shows Preference for Character-Level Processing

Most modern neural machine translation (NMT) systems rely on presegmente...
research
11/23/2022

Breaking the Representation Bottleneck of Chinese Characters: Neural Machine Translation with Stroke Sequence Modeling

Existing research generally treats Chinese character as a minimum unit f...
research
10/02/2018

Learning to Segment Inputs for NMT Favors Character-Level Processing

Most modern neural machine translation (NMT) systems rely on presegmente...

Please sign up or login with your details

Forgot password? Click here to reset