Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language

05/22/2020
by   Philipp Scharpf, et al.
0

In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8% and cluster purities up to 69.4% (number of clusters equals number of classes), and 99.9% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.

READ FULL TEXT

page 6

page 8

research
07/02/2020

A Novel Graph Based Clustering Approach to Document Topic Modeling

Clustering is the task of assigning a set of objects into groups so that...
research
09/02/2021

Towards Explaining STEM Document Classification using Mathematical Entity Linking

Document subject classification is essential for structuring (digital) l...
research
01/31/2010

Classifying the typefaces of the Gutenberg 42-line bible

We have measured the dissimilarities among several printed characters of...
research
11/02/2018

Comparison of Classification Algorithms Used Medical Documents Categorization

Volume of text based documents have been increasing day by day. Medical ...
research
06/24/2017

Semi-supervised Text Categorization Using Recursive K-means Clustering

In this paper, we present a semi-supervised learning algorithm for class...
research
01/18/2015

Mathematical Language Processing: Automatic Grading and Feedback for Open Response Mathematical Questions

While computer and communication technologies have provided effective me...

Please sign up or login with your details

Forgot password? Click here to reset