Supercharging Distributed Computing Environments For High Performance Data Engineering

01/19/2023
by   Niranda Perera, et al.
0

The data engineering and data science community has embraced the idea of using Python R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time effort. Therefore it is essential to design scalable dataframe solutions. There have been multiple attempts to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask/Ray distributed computing features look very promising, we perceive that the Dask Dataframes/Ray Datasets still have room for optimization. In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask/Ray infrastructure (thereby supercharging them!). To achieve this, we integrate a high performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30x more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to the native C++ execution of Cylon. We believe the success of Cylon CylonFlow extends beyond the data engineering domain, and can be used to consolidate high performance computing and distributed computing ecosystems.

READ FULL TEXT
research
07/03/2023

In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

The Data Science domain has expanded monumentally in both research and i...
research
07/27/2021

HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Data-intensive applications impact many domains, and their steadily incr...
research
03/18/2020

Convergence of Artificial Intelligence and High Performance Computing on NSF-supported Cyberinfrastructure

Significant investments to upgrade or construct large-scale scientific f...
research
09/13/2022

High Performance Dataframes from Parallel Processing Patterns

The data science community today has embraced the concept of Dataframes ...
research
02/12/2020

Eigenvector Component Calculation Speedup over NumPy for High-Performance Computing

Applications related to artificial intelligence, machine learning, and s...
research
07/16/2023

Arithmetic Deduction Model for High Performance Computing: A Comparative Exploration of Computational Models Paradigms

A myriad of applications ranging from engineering and scientific simulat...
research
10/27/2020

A Fast, Scalable, Universal Approach For Distributed Data Aggregations

In the current era of Big Data, data engineering has transformed into an...

Please sign up or login with your details

Forgot password? Click here to reset