Freeing Compute Caches from Serialization and Garbage Collection in Managed Big Data Analytics

11/20/2021
by   Iacovos G. Kolokasis, et al.
0

Managed analytics frameworks (e.g., Spark) cache intermediate results in memory (on-heap) or storage devices (off-heap) to avoid costly recomputations, especially in graph processing. As datasets grow, on-heap caching requires more memory for long-lived objects, resulting in high garbage collection (GC) overhead. On the other hand, off-heap caching moves cached objects on the storage device, reducing GC overhead, but at the cost of serialization and deserialization (S/D). In this work, we propose TeraHeap, a novel approach for providing large analytics caches. TeraHeap uses two heaps within the JVM (1) a garbage-collected heap for ordinary Spark objects and (2) a large heap memory-mapped over fast storage devices for cached objects. TeraHeap eliminates both S/D and GC over cached data without imposing any language restrictions. We implement TeraHeap in Oracle's Java runtime (OpenJDK-1.8). We use five popular, memory-intensive graph analytics workloads to understand S/D and GC overheads and evaluate TeraHeap. TeraHeap improves total execution time compared to state-of-the-art Apache Spark configurations by up to 72 and non-volatile memory, respectively. Furthermore, TeraCache requires 8x less DRAM capacity to provide performance comparable or higher than native Spark. This paper opens up emerging memory and storage devices for practical use in scalable analytics caching.

READ FULL TEXT
research
09/04/2023

Towards Persistent Memory based Stateful Serverless Computing for Big Data Applications

The Function-as-a-service (FaaS) computing model has recently seen signi...
research
05/22/2018

Storage and Memory Characterization of Data Intensive Workloads for Bare Metal Cloud

As the cost-per-byte of storage systems dramatically decreases, SSDs are...
research
11/09/2022

Performance Characterization of AutoNUMA Memory Tiering on Graph Analytics

Non-Volatile Memory (NVM) can deliver higher density and lower cost per ...
research
04/28/2016

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

While cluster computing frameworks are continuously evolving to provide ...
research
12/23/2021

In-storage Processing of I/O Intensive Applications on Computational Storage Drives

Computational storage drives (CSD) are solid-state drives (SSD) empowere...
research
01/28/2020

InfiniCache: Exploiting Ephemeral Serverless Functions to Build a Cost-Effective Memory Cache

Internet-scale web applications are becoming increasingly storage-intens...
research
08/18/2018

Pangea: Monolithic Distributed Storage for Data Analytics

Storage and memory systems for modern data analytics are heavily layered...

Please sign up or login with your details

Forgot password? Click here to reset