Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification

04/08/2021
by   Wilson Fearn, et al.
0

Text classification is a significant branch of natural language processing, and has many applications including document classification and sentiment analysis. Unsurprisingly, those who do text classification are concerned with the run-time of their algorithms, many of which depend on the size of the corpus' vocabulary due to their bag-of-words representation. Although many studies have examined the effect of preprocessing techniques on vocabulary size and accuracy, none have examined how these methods affect a model's run-time. To fill this gap, we provide a comprehensive study that examines how preprocessing techniques affect the vocabulary size, model performance, and model run-time, evaluating ten techniques over four models and two datasets. We show that some individual methods can reduce run-time with no loss of accuracy, while some combinations of methods can trade 2-5 65 techniques can even provide a 15 improving model accuracy.

READ FULL TEXT
research
04/04/2022

A pipeline and comparative study of 12 machine learning models for text classification

Text-based communication is highly favoured as a communication method, e...
research
09/10/2020

Analyze the Effects of Weighting Functions on Cost Function in the Glove Model

When dealing with the large vocabulary size and corpus size, the run-tim...
research
06/14/2021

Evaluating Various Tokenizers for Arabic Text Classification

The first step in any NLP pipeline is learning word vector representatio...
research
02/27/2019

How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection

With the rapid development in deep learning, deep neural networks have b...
research
08/26/2020

SHAP values for Explaining CNN-based Text Classification Models

Deep neural networks are increasingly used in natural language processin...
research
08/04/2021

TextCNN with Attention for Text Classification

The vast majority of textual content is unstructured, making automated c...
research
12/12/2016

A Binary Convolutional Encoder-decoder Network for Real-time Natural Scene Text Processing

In this paper, we develop a binary convolutional encoder-decoder network...

Please sign up or login with your details

Forgot password? Click here to reset