Bag of biterms modeling for short texts

by   Anh Phan Tuan, et al.

Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely Bag of Biterms Modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of Bag of Biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via Bag of Biterms, (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g, Bag of Words, tf-idf) even for normal texts.


page 1

page 2

page 3

page 4


Distributed Representations of Sentences and Documents

Many machine learning algorithms require the input to be represented as ...

Experiments on Generalizability of BERTopic on Multi-Domain Short Text

Topic modeling is widely used for analytically evaluating large collecti...

Author Clustering and Topic Estimation for Short Texts

Analysis of short text, such as social media posts, is extremely difficu...

Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

Analyzing short texts infers discriminative and coherent latent topics t...

Transfer Topic Modeling with Ease and Scalability

The increasing volume of short texts generated on social media sites, su...

Word Network Topic Model: A Simple but General Solution for Short and Imbalanced Texts

The short text has been the prevalent format for information of Internet...

Multidimensional counting grids: Inferring word order from disordered bags of words

Models of bags of words typically assume topic mixing so that the words ...

Please sign up or login with your details

Forgot password? Click here to reset