PHANTOM: Curating GitHub for engineered software projects using time-series clustering

04/25/2019
by   Peter Pickerill, et al.
0

Context: Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects. Unfortunately, the existing filtering methods have not kept up with the growth of existing data sources, such as GitHub, and researchers often rely on quick and dirty techniques to curate datasets. Objective: The objective of this study is to develop a method capable of filtering large quantities of software projects in a time-efficient way. Method: This study follows the Design Science Research (DSR) methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using the k-means algorithm. Results: Using the ground truth from a previous study, PHANTOM was shown to be able to rediscover the ground truth with up to 0.87 Precision or 0.94 Recall, and be able to identify "well-engineered" projects with up to 0.87 Precision and 0.94 Recall on the validation dataset. PHANTOM downloaded and processed the metadata of 1,786,601 GitHub repositories in 21.5 days, which is over 33% faster than a similar study, which used a computer cluster of 200 nodes. Conclusions: It is possible to use an unsupervised approach to identify well-engineering projects. PHANTOM was shown to be competitive compared to the existing supervised approaches while reducing the hardware requirements by two orders of magnitude.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/08/2018

Automatic Detection of Public Development Projects in Large Open Source Ecosystems: An Exploratory Study on GitHub

Hosting over 10 million of software projects, GitHub is one of the most ...
research
12/07/2021

Ground-Truth, Whose Truth? – Examining the Challenges with Annotating Toxic Text Datasets

The use of machine learning (ML)-based language models (LMs) to monitor ...
research
06/27/2022

Local Evaluation of Time Series Anomaly Detection Algorithms

In recent years, specific evaluation metrics for time series anomaly det...
research
05/10/2023

A Deep Dive into NFT Rug Pulls

NFT rug pull is one of the most prominent type of scam that the develope...
research
07/05/2021

UCSL : A Machine Learning Expectation-Maximization framework for Unsupervised Clustering driven by Supervised Learning

Subtype Discovery consists in finding interpretable and consistent sub-p...
research
02/06/2020

A Dataset for GitHub Repository Deduplication

GitHub projects can be easily replicated through the site's fork process...
research
01/24/2020

Clustering Methods Assessment for Investment in Zero Emission Neighborhoods Energy System

This paper investigates the use of clustering in the context of designin...

Please sign up or login with your details

Forgot password? Click here to reset