Dominant Set-based Active Learning for Text Classification and its Application to Online Social Media

by   Toktam A. Oghaz, et al.

Recent advances in natural language processing (NLP) in online social media are evidently owed to large-scale datasets. However, labeling, storing, and processing a large number of textual data points, e.g., tweets, has remained challenging. On top of that, in applications such as hate speech detection, labeling a sufficiently large dataset containing offensive content can be mentally and emotionally taxing for human annotators. Thus, NLP methods that can make the best use of significantly less labeled data points are of great interest. In this paper, we present a novel pool-based active learning method that can be used for the training of large unlabeled corpus with minimum annotation cost. For that, we propose to find the dominant sets of local clusters in the feature space. These sets represent maximally cohesive structures in the data. Then, the samples that do not belong to any of the dominant sets are selected to be used to train the model, as they represent the boundaries of the local clusters and are more challenging to classify. Our proposed method does not have any parameters to be tuned, making it dataset-independent, and it can approximately achieve the same classification accuracy as full training data, with significantly fewer data points. Additionally, our method achieves a higher performance in comparison to the state-of-the-art active learning strategies. Furthermore, our proposed algorithm is able to incorporate conventional active learning scores, such as uncertainty-based scores, into its selection criteria. We show the effectiveness of our method on different datasets and using different neural network architectures.


page 1

page 2

page 3

page 4


Active Learning for Event Detection in Support of Disaster Analysis Applications

Disaster analysis in social media content is one of the interesting rese...

Task-Aware Variational Adversarial Active Learning

Deep learning has achieved remarkable performance in various tasks thank...

Integrating Crowdsourcing and Active Learning for Classification of Work-Life Events from Tweets

Social media, especially Twitter, is being increasingly used for researc...

PyTAIL: Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data

Online data streams make training machine learning models hard because o...

CRL+: A Novel Semi-Supervised Deep Active Contrastive Representation Learning-Based Text Classification Model for Insurance Data

Financial sector and especially the insurance industry collect vast volu...

ALLSH: Active Learning Guided by Local Sensitivity and Hardness

Active learning, which effectively collects informative unlabeled data f...

Practical Active Learning with Model Selection for Small Data

Active learning is of great interest for many practical applications, es...

Please sign up or login with your details

Forgot password? Click here to reset