Document Generation with Hierarchical Latent Tree Models
In most probabilistic topic models, a document is viewed as a collection of tokens and each token is a variable whose values are all the words in a vocabulary. One exception is hierarchical latent tree models (HLTMs), where a document is viewed as a binary vector over the vocabulary and each word is regarded as a binary variable. The use of word variables allows the detection and representation of patterns of word co-occurrences and co-occurrences of those patterns qualitatively using multiple levels of latent variables, and naturally leads to a method for hierarchical topic detection. In this paper, we assume that an HLTM has been learned from binary data and we extend it to take word frequencies into consideration. The idea is to replace each binary word variable with a real-valued variable that represents the relative frequency of the word in a document. A document generation process is proposed and an algorithm is given for estimating the model parameters by inverting the generation process. Empirical results show that our method significantly outperforms the commonly-used LDA-based methods for hierarchical topic detection, in terms of model quality and meaningfulness of topics and topic hierarchies.
READ FULL TEXT