Parallel Quicksort without Pairwise Element Exchange
Standard implementations of 2-way, parallel, distributed memory Quicksort algorithms exchange partitioned data elements at each level of the recursion. This is not necessary: It suffices to exchange only the chosen pivots, while postponing element redistribution to the bottom of the recursion. This reduces the total volume of data exchanged from O(n p) to O(n), n being the total number of elements to be sorted and p a power-of-two number of processors, while preserving the flavor, characteristics and properties of a Quicksort implementation. We give a template implementation based on this observation, and compare against a standard, 2-way parallel Quicksort implementation as well as other recent Quicksort implementations. We show substantial, and considerably better absolute speed-up on a medium-large InfiniBand cluster.
READ FULL TEXT