Multilevel Text Normalization with Sequence-to-Sequence Networks and Multisource Learning

03/27/2019
by   Tatyana Ruzsics, et al.
0

We define multilevel text normalization as sequence-to-sequence processing that transforms naturally noisy text into a sequence of normalized units of meaning (morphemes) in three steps: 1) writing normalization, 2) lemmatization, 3) canonical segmentation. These steps are traditionally considered separate NLP tasks, with diverse solutions, evaluation schemes and data sources. We exploit the fact that all these tasks involve sub-word sequence-to-sequence transformation to propose a systematic solution for all of them using neural encoder-decoder technology. The specific challenge that we tackle in this paper is integrating the traditional know-how on separate tasks into the neural sequence-to-sequence framework to improve the state of the art. We address this challenge by enriching the general framework with mechanisms that allow processing the information on multiple levels of text organization (characters, morphemes, words, sentences) in combination with structural information (multilevel language model, part-of-speech) and heterogeneous sources (text, dictionaries). We show that our solution consistently improves on the current methods in all three steps. In addition, we analyze the performance of our system to show the specific contribution of the integrating components to the overall improvement.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/22/2018

Multi-scale Alignment and Contextual History for Attention Mechanism in Sequence-to-sequence Model

A sequence-to-sequence model is a neural network module for mapping two ...
research
04/12/2019

Adapting Sequence to Sequence models for Text Normalization in Social Media

Social media offer an abundant source of valuable raw data, however info...
research
09/05/2018

Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models

Text normalization is an important enabling technology for several NLP t...
research
01/21/2019

Chemical Names Standardization using Neural Sequence to Sequence Model

Chemical information extraction is to convert chemical knowledge in text...
research
11/29/2019

Neural Chinese Word Segmentation as Sequence to Sequence Translation

Recently, Chinese word segmentation (CWS) methods using neural networks ...
research
07/04/2017

Shakespearizing Modern Language Using Copy-Enriched Sequence-to-Sequence Models

Variations in writing styles are commonly used to adapt the content to a...

Please sign up or login with your details

Forgot password? Click here to reset