PolyFrame: A Retargetable Query-based Approach to Scaling DataFrames (Extended Version)

10/12/2020
by   Phanwadee Sinthong, et al.
0

In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision making and applications. Scaling data analysis, possibly including the application of custom machine learning models, to large volumes of data requires the utilization of distributed frameworks. This can lead to serious technical challenges for data analysts and reduce their productivity. AFrame, a Python data analytics library, is implemented as a layer on top of Apache AsterixDB, addressing these issues by incorporating the data scientists' development environment and transparently scaling out the evaluation of analytical operations through a Big Data management system. While AFrame is able to leverage data management facilities (e.g., indexes and query optimization) and allows users to interact with a very large volume of data, the initial version only generated SQL++ queries and only operated against Apache AsterixDB. In this work, we describe a new design that retargets AFrame's incremental query formation to other query-based database systems as well, making it more flexible for deployment against other data management systems with composable query languages.

READ FULL TEXT
research
08/19/2019

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis (Extended Version)

Analyzing the increasingly large volumes of data that are available toda...
research
02/18/2021

A Unified System for Data Analytics and In Situ Query Processing

In today's world data is being generated at a high rate due to which it ...
research
08/13/2019

Adaptive Learning of Aggregate Analytics under Dynamic Workloads

Large organizations have seamlessly incorporated data-driven decision ma...
research
04/12/2022

Forecasting SQL Query Cost at Twitter

With the advent of the Big Data era, it is usually computationally expen...
research
02/04/2019

Declarative Data Analytics: a Survey

The area of declarative data analytics explores the application of the d...
research
11/09/2019

DataSist: A Python-based library for easy data analysis, visualization and modeling

A large amount of data is produced every second from modern information ...
research
05/30/2022

Demonstration of LogicLib: An Expressive Multi-Language Interface over Scalable Datalog System

With the ever-increasing volume of data, there is an urgent need to prov...

Please sign up or login with your details

Forgot password? Click here to reset