AttaCut: A Fast and Accurate Neural Thai Word Segmenter

11/16/2019
by   Pattarawat Chormai, et al.
0

Word segmentation is a fundamental pre-processing step for Thai Natural Language Processing. The current off-the-shelf solutions are not benchmarked consistently, so it is difficult to compare their trade-offs. We conducted a speed and accuracy comparison of the popular systems on three different domains and found that the state-of-the-art deep learning system is slow and moreover does not use sub-word structures to guide the model. Here, we propose a fast and accurate neural Thai Word Segmenter that uses dilated CNN filters to capture the environment of each character and uses syllable embeddings as features. Our system runs at least 5.6x faster and outperforms the previous state-of-the-art system on some domains. In addition, we develop the first ML-based Thai orthographical syllable segmenter, which yields syllable embeddings to be used as features by the word segmenter.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/08/2020

Comparative Analysis of Word Embeddings for Capturing Word Similarities

Distributed language representation has become the most widely used tech...
research
09/03/2019

Aspect Detection using Word and Char Embeddings with (Bi)LSTM and CRF

We proposed a new accurate aspect extraction method that makes use of bo...
research
09/19/2017

A Fast and Accurate Vietnamese Word Segmenter

We propose a novel approach to Vietnamese word segmentation. Our approac...
research
11/17/2018

Unsupervised Post-processing of Word Vectors via Conceptor Negation

Word vectors are at the core of many natural language processing tasks. ...
research
06/27/2019

PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation

Chinese word segmentation (CWS) is a fundamental step of Chinese natural...
research
06/18/2019

State-of-the-Art Vietnamese Word Segmentation

Word segmentation is the first step of any tasks in Vietnamese language ...
research
03/11/2019

Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning

The ambiguous annotation criteria bring into the divergence of Chinese W...

Please sign up or login with your details

Forgot password? Click here to reset