Text Classification based on Word Subspace with Term-Frequency

by   Erica K. Shimomoto, et al.

Text classification has become indispensable due to the rapid increase of text in digital form. Over the past three decades, efforts have been made to approach this task using various learning algorithms and statistical models based on bag-of-words (BOW) features. Despite its simple implementation, BOW features lack semantic meaning representation. To solve this problem, neural networks started to be employed to learn word vectors, such as the word2vec. Word2vec embeds word semantic structure into vectors, where the angle between vectors indicates the meaningful similarity between words. To measure the similarity between texts, we propose the novel concept of word subspace, which can represent the intrinsic variability of features in a set of word vectors. Through this concept, it is possible to model text from word vectors while holding semantic information. To incorporate the word frequency directly in the subspace model, we further extend the word subspace to the term-frequency (TF) weighted word subspace. Based on these new concepts, text classification can be performed under the mutual subspace method (MSM) framework. The validity of our modeling is shown through experiments on the Reuters text database, comparing the results to various state-of-art algorithms.


page 1

page 2

page 3

page 4


Incorporating Word Embeddings into Open Directory Project based Large-scale Classification

Recently, implicit representation models, such as embedding or deep lear...

An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation

With the rapid growth of Text sentiment analysis, the demand for automat...

Deep Extrofitting: Specialization and Generalization of Expansional Retrofitting Word Vectors using Semantic Lexicons

The retrofitting techniques, which inject external resources into word r...

Measuring Word Significance using Distributed Representations of Words

Distributed representations of words as real-valued vectors in a relativ...

Two halves of a meaningful text are statistically different

Which statistical features distinguish a meaningful text (possibly writt...

From Review to Rating: Exploring Dependency Measures for Text Classification

Various text analysis techniques exist, which attempt to uncover unstruc...

Categorical Metadata Representation for Customized Text Classification

The performance of text classification has improved tremendously using i...

Please sign up or login with your details

Forgot password? Click here to reset