Deterministic Data Distribution for Efficient Recovery in Erasure-Coded Storage Systems
Due to individual unreliable commodity components, failures are common in large-scale distributed storage systems. Erasure codes are widely deployed in practical storage systems to provide fault tolerance with low storage overhead. However, random data distribution (RDD), commonly used in erasure-coded storage systems, induces heavy cross-rack traffic, load imbalance, and random access, which adversely affects failure recovery. In this paper, with orthogonal arrays, we define a Deterministic Data Distribution (D^3) to uniformly distribute data/parity blocks among nodes, and propose an efficient failure recovery approach based on D^3, which minimizes the cross-rack repair traffic against a single node failure. Thanks to the uniformity of D^3, the proposed recovery approach balances the repair traffic not only among nodes within a rack but also among racks. We implement D^3 over Reed-Solomon codes and Locally Repairable Codes in Hadoop Distributed File System (HDFS) with a cluster of 28 machines. Compared with RDD, our experiments show that D^3 significantly speeds up the failure recovery up to 2.49 times for RS codes and 1.38 times for LRCs. Moreover, D^3 supports front-end applications better than RDD in both of normal and recovery states.
READ FULL TEXT