S/C: Speeding up Data Materialization with Bounded Memory

03/17/2023
by   Zhaoheng Li, et al.
0

With data pipeline tools and the expressiveness of SQL, managing interdependent materialized views (MVs) are becoming increasingly easy. These MVs are updated repeatedly upon new data ingestion (e.g., daily), from which database admins can observe performance metrics (e.g., refresh time of each MV, size on disk) in a consistent way for different types of updates (full vs. incremental) and for different systems (single node, distributed, cloud-hosted). One missed opportunity is that existing data systems treat those MV updates as independent SQL statements without fully exploiting their dependency information and performance metrics. However, if we know that the result of a SQL statement will be consumed immediately after for subsequent operations, those subsequent operations do not have to wait until the early results are fully materialized on storage because the results are already readily available in memory. Of course, this may come at a cost because keeping results in memory (even temporarily) will reduce the amount of available memory; thus, our decision should be careful. In this paper, we introduce a new system, called S/C, which tackles this problem through efficient creation and update of a set of MVs with acyclic dependencies among them. S/C judiciously uses bounded memory to reduce end-to-end MV refresh time by short-circuiting expensive reads and writes; S/C's objective function accurately estimates time savings from keeping intermediate data in memory for particular periods. Our solution jointly optimizes an MV refresh order, what data to keep in memory, and when to release data from memory. At a high level, S/C still materializes all data exactly as defined in MV definitions; thus, it doesn't impact any service-level agreements. In our experiments with TPC-DS datasets (up to 1TB), we show S/C's optimization can speedup end-to-end runtime by 1.04x-5.08x with 1.6GB memory.

READ FULL TEXT
research
01/18/2023

Coverage of Course Topics in Learnersourced SQL Exercises

Learnersourcing is a common task in modern computing classrooms, where i...
research
01/19/2018

Towards a Theory of Data-Diff: Optimal Synthesis of Succinct Data Modification Scripts

This paper addresses the Data-Diff problem: given a dataset and a subseq...
research
05/29/2020

ValueNet: A Neural Text-to-SQL Architecture Incorporating Values

Building natural language interfaces for databases has been a long-stand...
research
12/28/2022

End-to-End Application Cloning for Distributed Cloud Microservices with Ditto

We present Ditto, an automated framework for cloning end-to-end cloud ap...
research
10/05/2021

Online Application Guidance for Heterogeneous Memory Systems

Many high end and next generation computing systems to incorporated alte...
research
08/11/2020

High-concurrency Custom-build Relational Database System's design and SQL parser design based on Turing-complete automata

Database system is an indispensable part of software projects. It plays ...
research
04/14/2023

HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs

Collaborative filtering (CF) has been proven to be one of the most effec...

Please sign up or login with your details

Forgot password? Click here to reset