Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization

Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embeddings). It generates continuous numeric vectors of high-dimensionality to represent words. The vectors explicitly encode many linguistic regularities and patterns, as well as syntactic and semantic word relationships. Words that share semantic similarity are represented by similar vectors. Based on these features, we present a totally unsupervised, expandable and language and domain independent method for learning normalization lexicons from word embeddings. Our approach obtains high correction rate of orthographic errors and internet slang in product reviews, outperforming the current available tools for Brazilian Portuguese.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2018

Word Embeddings from Large-Scale Greek Web content

Word embeddings are undoubtedly very useful components in many NLP tasks...
research
01/21/2020

Generating Sense Embeddings for Syntactic and Semantic Analogy for Portuguese

Word embeddings are numerical vectors which can represent words or conce...
research
07/04/2019

Morphological Word Embeddings

Linguistic similarity is multi-faceted. For instance, two words may be s...
research
12/19/2022

Independent Components of Word Embeddings Represent Semantic Features

Independent Component Analysis (ICA) is an algorithm originally develope...
research
12/05/2017

AWE-CM Vectors: Augmenting Word Embeddings with a Clinical Metathesaurus

In recent years, word embeddings have been surprisingly effective at cap...
research
08/10/2018

Unsupervised Keyphrase Extraction Based on Outlier Detection

We propose a novel unsupervised keyphrase extraction approach based on o...
research
08/10/2015

Measuring Word Significance using Distributed Representations of Words

Distributed representations of words as real-valued vectors in a relativ...

Please sign up or login with your details

Forgot password? Click here to reset