PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

11/15/2017
by   Jia Zou, et al.
0

This paper describes PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries. In the large, PlinyCompute presents the programmer with a very high-level, declarative interface, relying on automatic, relational-database style optimization to figure out how to stage distributed computations. However, in the small, PlinyCompute presents the capable systems programmer with a persistent object data model and API (the "PC object model") and associated memory management system that has been designed from the ground-up for high performance, distributed, data-intensive computing. This contrasts with most other Big Data systems, which are constructed on top of the Java Virtual Machine (JVM), and hence must at least partially cede performance-critical concerns such as memory management (including layout and de/allocation) and virtual method/function dispatch to the JVM. This hybrid approach---declarative in the large, trusting the programmer's ability to utilize PC object model efficiently in the small---results in a system that is ideal for the development of reusable, data-intensive tools and libraries. Through extensive benchmarking, we show that implementing complex objects manipulation and non-trivial, library-style computations on top of PlinyCompute can result in a speedup of 2x to more than 50x or more compared to equivalent implementations on Spark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/12/2018

HPS: A C++11 High Performance Serialization Library

Data serialization is a common and crucial component in high performance...
research
10/28/2018

DynaSOAr: A Parallel Memory Allocator for Object-oriented Programming on GPUs with Efficient Memory Access

Object-oriented programming has long been regarded as too inefficient fo...
research
04/30/2018

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

One of the hardest challenges of the current Big Data landscape is the l...
research
05/08/2020

High Performance Cluster Computing for MapReduce

MapReduce is a technique used to vastly improve distributed processing o...
research
02/16/2023

GEMMFIP: Unifying GEMM in BLIS

Matrix libraries often focus on achieving high performance for problems ...
research
07/16/2017

Performance Evaluation of Distributed Computing Environments with Hadoop and Spark Frameworks

Recently, due to rapid development of information and communication tech...
research
07/27/2021

HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Data-intensive applications impact many domains, and their steadily incr...

Please sign up or login with your details

Forgot password? Click here to reset