A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments

10/07/2020
by   Mehdi Golzadeh, et al.
0

Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset is available, nor are classification models to detect and validate bots on the basis of such a dataset. This paper proposes such a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts of which 527 accounts have been identified as bots. Using this dataset we propose an automated classification model based on the random forest classifier, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns. We obtained a very high accuracy (weighted F1-score of 0.99) on the remaining test set containing 40 misclassified as humans. We integrated the classification model into an open source command-line tool, to allow practitioners to detect which accounts in a given Github repository actually correspond to bots.

READ FULL TEXT
research
03/10/2021

Identifying bot activity in GitHub pull request and issue comments

Development bots are used on Github to automate repetitive activities. S...
research
03/22/2021

Evaluating a bot detection model on git commit messages

Detecting the presence of bots in distributed software development activ...
research
09/06/2022

Understanding Longitudinal Behaviors of Toxic Accounts on Reddit

Toxic comments are the top form of hate and harassment experienced onlin...
research
06/03/2021

Automatically Detecting Cyberbullying Comments on Online Game Forums

Online game forums are popular to most of game players. They use it to c...
research
02/25/2023

STACC: Code Comment Classification using SentenceTransformers

Code comments are a key resource for information about software artefact...
research
11/10/2020

A Transfer Learning Approach for Dialogue Act Classification of GitHub Issue Comments

Social coding platforms, such as GitHub, serve as laboratories for study...
research
08/14/2023

CupCleaner: A Data Cleaning Approach for Comment Updating

Recently, deep learning-based techniques have shown promising performanc...

Please sign up or login with your details

Forgot password? Click here to reset