Taming Wild High Dimensional Text Data with a Fuzzy Lash

12/16/2017
by   Amir Karami, et al.
0

The bag of words (BOW) represents a corpus in a matrix whose elements are the frequency of words. However, each row in the matrix is a very high-dimensional sparse vector. Dimension reduction (DR) is a popular method to address sparsity and high-dimensionality issues. Among different strategies to develop DR method, Unsupervised Feature Transformation (UFT) is a popular strategy to map all words on a new basis to represent BOW. The recent increase of text data and its challenges imply that DR area still needs new perspectives. Although a wide range of methods based on the UFT strategy has been developed, the fuzzy approach has not been considered for DR based on this strategy. This research investigates the application of fuzzy clustering as a DR method based on the UFT strategy to collapse BOW matrix to provide a lower-dimensional representation of documents instead of the words in a corpus. The quantitative evaluation shows that fuzzy clustering produces superior performance and features to Principal Components Analysis (PCA) and Singular Value Decomposition (SVD), two popular DR methods based on the UFT strategy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2019

Application of Fuzzy Clustering for Text Data Dimensionality Reduction

Large textual corpora are often represented by the document-term frequen...
research
04/19/2018

Mathematical Analysis on Out-of-Sample Extensions

Let X=X∪Z be a data set in R^D, where X is the training set and Z is the...
research
05/23/2023

SNEkhorn: Dimension Reduction with Symmetric Entropic Affinities

Many approaches in machine learning rely on a weighted graph to encode t...
research
12/10/2022

Information retrieval in single cell chromatin analysis using TF-IDF transformation methods

Single-cell sequencing assay for transposase-accessible chromatin (scATA...
research
07/17/2019

Analysis of Word Embeddings using Fuzzy Clustering

In data dominated systems and applications, a concept of representing wo...
research
02/25/2022

Asyncval: A Toolkit for Asynchronously Validating Dense Retriever Checkpoints during Training

The process of model checkpoint validation refers to the evaluation of t...
research
03/09/2023

Entropic Wasserstein Component Analysis

Dimension reduction (DR) methods provide systematic approaches for analy...

Please sign up or login with your details

Forgot password? Click here to reset