Linked Data Science Powered by Knowledge Graphs

by   Mossad Helali, et al.

In recent years, we have witnessed a growing interest in data science not only from academia but particularly from companies investing in data science platforms to analyze large amounts of data. In this process, a myriad of data science artifacts, such as datasets and pipeline scripts, are created. Yet, there has so far been no systematic attempt to holistically exploit the collected knowledge and experiences that are implicitly contained in the specification of these pipelines, e.g., compatible datasets, cleansing steps, ML algorithms, parameters, etc. Instead, data scientists still spend a considerable amount of their time trying to recover relevant information and experiences from colleagues, trial and error, lengthy exploration, etc. In this paper, we, therefore, propose a scalable system (KGLiDS) that employs machine learning to extract the semantics of data science pipelines and captures them in a knowledge graph, which can then be exploited to assist data scientists in various ways. This abstraction is the key to enabling Linked Data Science since it allows us to share the essence of pipelines between platforms, companies, and institutions without revealing critical internal information and instead focusing on the semantics of what is being processed and how. Our comprehensive evaluation uses thousands of datasets and more than thirteen thousand pipeline scripts extracted from data discovery benchmarks and the Kaggle portal and shows that KGLiDS significantly outperforms state-of-the-art systems on related tasks, such as dataset recommendation and pipeline classification.


page 1

page 2

page 3

page 4


The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large

Increasingly larger number of software systems today are including data ...

Federated Data Science to Break Down Silos [Vision]

Similar to Open Data initiatives, data science as a community has launch...

Data Science with Vadalog: Bridging Machine Learning and Reasoning

Following the recent successful examples of large technology companies, ...

KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis

Knowledge graphs (KGs) have become the preferred technology for represen...

Progressive Data Science: Potential and Challenges

Data science requires time-consuming iterative manual activities. In par...

JITA4DS: Disaggregated execution of Data Science Pipelines between the Edge and the Data Centre

This paper targets the execution of data science (DS) pipelines supporte...

The Big Three: A Methodology to Increase Data Science ROI by Answering the Questions Companies Care About

Companies may be achieving only a third of the value they could be getti...

Please sign up or login with your details

Forgot password? Click here to reset