Reproducible and Portable Big Data Analytics in the Cloud

12/17/2021
by   Xin Wang, et al.
0

Cloud computing has become a major approach to help reproduce computational experiments because it supports on-demand hardware and software resource provisioning. Yet there are still two main difficulties in reproducing big data applications in the cloud. The first is how to automate end-to-end execution of analytics including environment provisioning, analytics pipeline description, pipeline execution and resource termination. The second is an application developed for one cloud, is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automatic scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports 1) fully automated end-to-end execution and reproduction via a single command, 2) automatic data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproduction of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using three big data analytics applications that run on a virtual CPU/GPU cluster. The experiments show our toolkit can achieve good execution performance, scalability and efficient reproducibility for cloud-based big data analytics.

READ FULL TEXT
research
11/05/2020

Video Big Data Analytics in the Cloud: Research Issues and Challenges

On the rise of distributed computing technologies, video big data analyt...
research
05/28/2023

Towards Confidential Computing: A Secure Cloud Architecture for Big Data Analytics and AI

Cloud computing provisions computer resources at a cost-effective way ba...
research
11/15/2022

Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics

Choosing a good resource configuration for big data analytics applicatio...
research
02/14/2019

OPENMENDEL: A Cooperative Programming Project for Statistical Genetics

Statistical methods for genomewide association studies (GWAS) continue t...
research
03/16/2018

Serverless Data Analytics with Flint

Serverless architectures organized around loosely-coupled function invoc...
research
05/14/2018

Fork and Join Queueing Networks with Heavy Tails: Scaling Dimension and Throughput Limit

Parallel and distributed computing systems are foundational to the succe...
research
01/02/2019

Approximate Computation for Big Data Analytics

Over the past a few years, research and development has made significant...

Please sign up or login with your details

Forgot password? Click here to reset