Sentence-level dialects identification in the greater China region

01/08/2017
by   Fan Xu, et al.
0

Identifying the different varieties of the same language is more challenging than unrelated languages identification. In this paper, we propose an approach to discriminate language varieties or dialects of Mandarin Chinese for the Mainland China, Hong Kong, Taiwan, Macao, Malaysia and Singapore, a.k.a., the Greater China Region (GCR). When applied to the dialects identification of the GCR, we find that the commonly used character-level or word-level uni-gram feature is not very efficient since there exist several specific problems such as the ambiguity and context-dependent characteristic of words in the dialects of the GCR. To overcome these challenges, we use not only the general features like character-level n-gram, but also many new word-level features, including PMI-based and word alignment-based features. A series of evaluation results on both the news and open-domain dataset from Wikipedia show the effectiveness of the proposed approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2018

End-to-End Text Classification via Image-based Embedding using Character-level Networks

For analysing and/or understanding languages having no word boundaries b...
research
08/10/2016

Hierarchical Character-Word Models for Language Identification

Social media messages' brevity and unconventional spelling pose a challe...
research
08/22/2018

A syllable based model for handwriting recognition

In this paper, we introduce a new modeling approach of texts for handwri...
research
12/03/2018

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Word segmentation is the task of inserting or deleting word boundary cha...
research
05/31/2019

Investigating an Effective Character-level Embedding in Korean Sentence Classification

Different from the writing systems of many Romance and Germanic language...
research
07/08/2015

What Your Username Says About You

Usernames are ubiquitous on the Internet, and they are often suggestive ...
research
12/25/2019

N-gram Statistical Stemmer for Bangla Corpus

Stemming is a process that can be utilized to trim inflected words to st...

Please sign up or login with your details

Forgot password? Click here to reset