Parallel External Sorting of ASCII Records Using Learned Models

05/08/2023
by   Ani Kristo, et al.
0

External sorting is at the core of many operations in large-scale database systems, such as ordering and aggregation queries for large result sets, building indexes, sort-merge joins, duplicate removal, sharding, and record clustering. Unlike in-memory sorting, these algorithms need to work together with the OS and the filesystem to efficiently utilize system resources and minimize disk I/O. In this paper we describe ELSAR: a parallel external sorting algorithm that uses an innovative paradigm based on a learned data distribution model. The algorithm leverages the model to arrange the input records into mutually exclusive, monotonic, and equi-depth partitions that, once sorted, can simply be concatenated to form the output. This method completely eliminates the need for multi-way file merging, which is typically used in external sorting. We present thorough benchmarks for uniform and skewed datasets in various storage media, where we measure the sorting rates, size scalability, and energy efficiency of ELSAR and other sorting algorithms. We observed that ELSAR has up to 1.65x higher sorting rates than the next-best external sort (Nsort) on SSD drives and 5.31x higher than the GNU coreutils' sort utility on Intel Optane non-volatile memory. In addition, ELSAR supersedes the current winner of the SortBenchmark for the most energy-efficient external string sorting algorithm by an impressive margin of 41 These results reinforce the premise that novel learning-enhanced algorithms can provide remarkable performance benefits over traditional ones.

READ FULL TEXT

page 8

page 10

research
08/02/2018

Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools

This dissertation focuses on two fundamental sorting problems: string so...
research
12/10/2021

FLiMS: a Fast Lightweight 2-way Merger for Sorting

In this paper, we present FLiMS, a highly-efficient and simple parallel ...
research
07/05/2021

Defeating duplicates: A re-design of the LearnedSort algorithm

LearnedSort is a novel sorting algorithm that, unlike traditional method...
research
10/01/2020

Sort-based grouping and aggregation

Database query processing requires algorithms for duplicate removal, gro...
research
12/08/2016

Sorting Data on Ultra-Large Scale with RADULS. New Incarnation of Radix Sort

The paper introduces RADULS, a new parallel sorter based on radix sort a...
research
09/17/2022

Robust and Efficient Sorting with Offset-Value Coding

Sorting and searching are large parts of database query processing, e.g....
research
09/17/2019

Leyenda: An Adaptive, Hybrid Sorting Algorithm for Large Scale Data with Limited Memory

Sorting is the one of the fundamental tasks of modern data management sy...

Please sign up or login with your details

Forgot password? Click here to reset