Leveraging Machine Learning to Detect Data Curation Activities

04/30/2021
by   Sara Lafia, et al.
0

This paper describes a machine learning approach for annotating and analyzing data curation work logs at ICPSR, a large social sciences data archive. The systems we studied track curation work and coordinate team decision-making at ICPSR. Repository staff use these systems to organize, prioritize, and document curation work done on datasets, making them promising resources for studying curation work and its impact on data reuse, especially in combination with data usage analytics. A key challenge, however, is classifying similar activities so that they can be measured and associated with impact metrics. This paper contributes: 1) a schema of data curation activities; 2) a computational model for identifying curation actions in work log descriptions; and 3) an analysis of frequent data curation activities at ICPSR over time. We first propose a schema of data curation actions to help us analyze the impact of curation work. We then use this schema to annotate a set of data curation logs, which contain records of data transformations and project management decisions completed by repository staff. Finally, we train a text classifier to detect the frequency of curation actions in a large set of work logs. Our approach supports the analysis of curation work documented in work log systems as an important step toward studying the relationship between research data curation and data reuse.

READ FULL TEXT

page 1

page 5

page 7

research
11/03/2017

Discovering More Precise Process Models from Event Logs by Filtering Out Chaotic Activities

Process Discovery is concerned with the automatic generation of a proces...
research
05/04/2018

Assessing Data Usefulness for Failure Analysis in Anonymized System Logs

System logs are a valuable source of information for the analysis and un...
research
12/22/2021

Log severity level classification: an approach for systems in production

Context: Logs are often the primary source of information for system dev...
research
01/24/2020

Software Logging for Machine Learning

System logs perform a critical function in software-intensive systems as...
research
10/05/2021

Notarial timestamps savings in logs management via Merkle trees and Key Derivation Functions

Nowadays log files handling imposes to ISPs (intended in their widest sc...
research
07/16/2023

Mining Reviews in Open Source Code for Developers Trail: A Process Mining Approach

Audit trails are evidential indications of activities performers in any ...
research
06/16/2021

Making Sense of Learning Log Data

Research is constantly engaged in finding more productive and powerful w...

Please sign up or login with your details

Forgot password? Click here to reset