Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers

by   Frederico Dias Souza, et al.

Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese texts is scarce, especially when considering commercial user reviews. Therefore, this work aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. The methods are evaluated with five open-source databases with pre-defined data partitions made available in an open digital repository to encourage reproducibility. The Fine-tuned TLMs achieved the best results for all cases, being followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks, depending on the database under analysis.


page 1

page 2

page 3

page 4


Deep Learning for Hindi Text Classification: A Comparison

Natural Language Processing (NLP) and especially natural language text a...

Myers-Briggs personality classification from social media text using pre-trained language models

In Natural Language Processing, the use of pre-trained language models h...

Convolutional Neural Networks for Sentiment Classification on Business Reviews

Recently Convolutional Neural Networks (CNNs) models have proven remarka...

Sentiment analysis in tweets: an assessment study from classical to modern text representation models

With the growth of social medias, such as Twitter, plenty of user-genera...

BERT for Sentiment Analysis: Pre-trained and Fine-Tuned Alternatives

BERT has revolutionized the NLP field by enabling transfer learning with...

Deep Learning Models for Automatic Summarization

Text summarization is an NLP task which aims to convert a textual docume...

Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark

Large language models (LLMs) have demonstrated powerful capabilities in ...

Please sign up or login with your details

Forgot password? Click here to reset