A new simple and effective measure for bag-of-word inter-document similarity measurement

02/09/2019
by   Sunil Aryal, et al.
0

To measure the similarity of two documents in the bag-of-words (BoW) vector representation, different term weighting schemes are used to improve the performance of cosine similarity---the most widely used inter-document similarity measure in text mining. In this paper, we identify the shortcomings of the underlying assumptions of term weighting in the inter-document similarity measurement task; and provide a more fit-to-the-purpose alternative. Based on this new assumption, we introduce a new simple but effective similarity measure which does not require explicit term weighting. The proposed measure employs a more nuanced probabilistic approach than those used in term weighting to measure the similarity of two documents w.r.t each term occurring in the two documents. Our empirical comparison with the existing similarity measures using different term weighting schemes shows that the new measure produces (i) better results in the binary BoW representation; and (ii) competitive and more consistent results in the term-frequency-based BoW representation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2022

Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

The task of determining the similarity of text documents has received co...
research
03/27/2013

Context-Dependent Similarity

Attribute weighting and differential weighting, two major mechanisms for...
research
07/17/2020

Scalable Methods for Calculating Term Co-Occurrence Frequencies

Search techniques make use of elementary information such as term freque...
research
08/25/2016

A Novel Term_Class Relevance Measure for Text Categorization

In this paper, we introduce a new measure called Term_Class relevance to...
research
02/26/2019

Improving a tf-idf weighted document vector embedding

We examine a number of methods to compute a dense vector embedding for a...
research
04/25/2023

A Novel Dual of Shannon Information and Weighting Scheme

Shannon Information theory has achieved great success in not only commun...
research
11/30/2018

Document Structure Measure for Hypernym discovery

Hypernym discovery is the problem of finding terms that have is-a relati...

Please sign up or login with your details

Forgot password? Click here to reset