The Influence of Feature Representation of Text on the Performance of Document Classification

In this paper we perform a comparative analysis of three models for feature representation of text documents in the context of document classification. In particular, we consider the most often used family of models bag-of-words, recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based model that have been rarely considered for representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated the three models and their variants. The results of the empirical comparison show that the commonly used bag-of-words model has performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/28/2013

An alternative text representation to TF-IDF and Bag-of-Words

In text mining, information retrieval, and machine learning, text docume...
research
12/19/2014

N-gram-Based Low-Dimensional Representation for Document Classification

The bag-of-words (BOW) model is the common approach for classifying docu...
research
05/16/2014

Distributed Representations of Sentences and Documents

Many machine learning algorithms require the input to be represented as ...
research
09/02/2020

Identifying Documents In-Scope of a Collection from Web Archives

Web archive data usually contains high-quality documents that are very u...
research
10/04/2015

A Novel Approach to Document Classification using WordNet

Content based Document Classification is one of the biggest challenges i...
research
11/01/2021

Comparative Study of Long Document Classification

The amount of information stored in the form of documents on the interne...
research
12/23/2016

"What is Relevant in a Text Document?": An Interpretable Machine Learning Approach

Text documents can be described by a number of abstract concepts such as...

Please sign up or login with your details

Forgot password? Click here to reset