On the Optimality of Scheduling Dependent MapReduce Tasks on Heterogeneous Machines
MapReduce is the most popular big-data computation framework, motivating many research topics. A MapReduce job consists of two successive phases, i.e. map phase and reduce phase. Each phase can be divided into multiple tasks. A reduce task can only start when all the map tasks finish processing. A job is successfully completed when all its map and reduce tasks are complete. The task of optimally scheduling the different tasks on different servers to minimize the weighted completion time is an open problem, and is the focus of this paper. In this paper, we give an approximation ratio with a competitive ratio 2(1+(m-1)/D)+1, where m is the number of servers and D> 1 is the task-skewness product. We implement the proposed algorithm on Hadoop framework, and compare with three baseline schedulers. Results show that our DMRS algorithm can outperform baseline schedulers by up to 82%.
READ FULL TEXT