Towards Scalable Dataframe Systems

01/03/2020
by   Devin Petersohn, et al.
0

Dataframes are a popular and convenient abstraction to represent, structure, clean, and analyze data during exploratory data analysis. Despite the success of dataframe libraries in R and Python (pandas), dataframes face performance issues even on moderately large datasets. In this vision paper, we take the first steps towards formally defining dataframes, characterizing their properties, and outlining a research agenda towards making dataframes more interactive at scale. We draw on tools and techniques from the database community, and describe ways they may be adapted to serve dataframe systems, as well as the new challenges therein. We also describe our current progress toward a scalable dataframe system, Modin, which is already up to 30times faster than pandas in preliminary case studies, while enabling unmodified pandas code to run as-is. In its first 18 months, Modin is already used by over 60 downstream projects, has over 250 forks, and 3,900 stars on GitHub, indicating the pressing need for pursuing this agenda.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2023

Exploring Xenophobic Events through GDELT Data Analysis

This study explores xenophobic events related to refugees and migration ...
research
03/27/2019

The Landscape of R Packages for Automated Exploratory Data Analysis

The increasing availability of large but noisy data sets with a large nu...
research
12/19/2019

Data Science through the looking glass and what we found there

The recent success of machine learning (ML) has led to an explosive grow...
research
04/18/2019

One DSL to Rule Them All: IDE-Assisted Code Generation for Agile Data Analysis

Data analysis is at the core of scientific studies, a prominent task tha...
research
10/15/2019

SCALPEL3: a scalable open-source library for healthcare claims databases

This article introduces SCALPEL3, a scalable open-source framework for s...
research
05/15/2023

Transactional Python for Durable Machine Learning: Vision, Challenges, and Feasibility

In machine learning (ML), Python serves as a convenient abstraction for ...
research
06/25/2020

FastSpec: Scalable Generation and Detection of Spectre Gadgets Using Neural Embeddings

Several techniques have been proposed to detect vulnerable Spectre gadge...

Please sign up or login with your details

Forgot password? Click here to reset