What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

08/29/2016
by   Amritanshu Agrawal, et al.
0

Context: Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeler is Latent Dirichlet allocation. When run on different datasets, LDA suffers from "order effects" i.e. different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results;specifically, inaccurate topic descriptions and a reduction in the efficacy of text mining classification results. Objective: To provide a method in which distributions generated by LDA are more stable and can be used for further analysis. Method: We use LDADE, a search-based software engineering tool that tunes LDA's parameters using DE (Differential Evolution). LDADE is evaluated on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands ofSoftware Engineering (SE) papers, and software defect reports from NASA. Results were collected across different implementations of LDA (Python+Scikit-Learn, Scala+Spark); across different platforms (Linux, Macintosh) and for different kinds of LDAs (VEM,or using Gibbs sampling). Results were scored via topic stability and text mining classification accuracy. Results: In all treatments: (i) standard LDA exhibits very large topic instability; (ii) LDADE's tunings dramatically reduce cluster instability; (iii) LDADE also leads to improved performances for supervised as well as unsupervised learning. Conclusion: Due to topic instability, using standard LDA with its "off-the-shelf" settings should now be depreciated. Also, in future, we should require SE papers that use LDA to test and (if needed) mitigate LDA topic instability. Finally, LDADE is a candidate technology for effectively and efficiently reducing that instability.

READ FULL TEXT
research
11/12/2017

Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey

Topic modeling is one of the most powerful techniques in text mining for...
research
04/27/2018

Can You Explain That, Better? Comprehensible Text Analytics for SE Applications

Text mining methods are used for a wide range of Software Engineering (S...
research
08/24/2018

Measuring LDA Topic Stability from Clusters of Replicated Runs

Background: Unstructured and textual data is increasing rapidly and Late...
research
08/11/2018

Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering

In the last decade, a variety of topic models have been proposed for tex...
research
04/13/2018

Per-Corpus Configuration of Topic Modelling for GitHub and Stack Overflow Collections

To make sense of large amounts of textual data, topic modelling is frequ...
research
02/18/2022

A new LDA formulation with covariates

The Latent Dirichlet Allocation (LDA) model is a popular method for crea...
research
04/10/2018

Towards Training Probabilistic Topic Models on Neuromorphic Multi-chip Systems

Probabilistic topic models are popular unsupervised learning methods, in...

Please sign up or login with your details

Forgot password? Click here to reset