A Comparative Study on Data Representation to Categorize Text Documents

03/06/2022
by   Dulani Meedeniya, et al.
0

In the modern world text documents play an important role in most of the organizations. Their constant growth widens the scope of document storage. As a result, there is a potential need for effective text retrieval and search capabilities. This paper suggests two document preprocessing methods. The objective of this study is to find an appropriate data representation for text categorization by comparing two data representation approaches. The first approach groups the documents based on their title and the second approach considers the document body to group documents. Both methods apply the same clustering and classification techniques on the test data sets. It applies clustering to divide the documents into categories and uses classification techniques to validate the clustering results. This study shows that the text documents grouping based on document titles has high performances than the other approach.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset