Multi-level Forwarding and Scheduling Recovery Algorithm in Rapidly-changing Network for Erasure-coded Clusters

by   Hai Zhou, et al.

A key design goal of erasure-coded clusters is to reduce the repair time. The existing Erasure-coded data repair schemes are roughly classified into two categories: 1. Designing rapid data repair (e.g., PPR) in a homogeneous environment. 2. Constructing data repair (e.g., PPT) based on bandwidth in a heterogeneous environment. However, these solutions are difficult to cope with the heterogeneous and Rapidly-changing network in erasure-coded clusters. To address this problem, a bandwidth-aware multi-level forwarding repair algorithm, called BMFRepair, is proposed. BMFRepair monitors the network bandwidth in real time when data is forwarded, and selects idle nodes with high-bandwidth links to assist in forwarding. Thus, it can reduce the time bottleneck caused by low link transmission. At the same time, multi-node repair becomes very complicated when the bandwidth changes drastically. A multi-node scheduling repairing algorithm, called MSRepair, is proposed for multi-node repairing problems, which can repair multiple failed blocks in parallel by scheduling node resources. The two algorithms can flexibly adapt to the rapidly changing network environment and make full use of the bandwidth resources of idle nodes. Most importantly, algorithms can continuously adjust the repair plan according to the bandwidth change in fast and dynamic network. The algorithms have been evaluated by both simulations on Mininet and real experiments on Aliyun cloud platform ECS. Results show that compared with the state-of-the-art repair schemes PPR and PPT, the algorithms can significantly reduce the repair time in rapidly-changing network.


page 3

page 4

page 6

page 7

page 8

page 9


Repair Pipelining for Erasure-Coded Storage: Algorithms and Evaluation

We propose repair pipelining, a technique that speeds up the repair perf...

An Efficient Piggybacking Design Framework with Sub-packetization l≤ r for All-Node Repair

Piggybacking design has been widely applied in distributed storage syste...

Two Piggybacking Codes with Flexible Sub-Packetization to Achieve Lower Repair Bandwidth

As a special class of array codes, (n,k,m) piggybacking codes are MDS co...

Capacity of Distributed Storage Systems with Clusters and Separate Nodes

In distributed storage systems (DSSs), the optimal tradeoff between node...

Node repair on connected graphs, Part II

We continue our study of regenerating codes in distributed storage syste...

Storage-Repair Bandwidth Trade-off for Wireless Caching with Partial Failure and Broadcast Repair

Repair of multiple partially failed cache nodes is studied in a distribu...

Fast Biconnectivity Restoration in Multi-Robot Systems for Robust Communication Maintenance

Maintaining a robust communication network plays an important role in th...

Please sign up or login with your details

Forgot password? Click here to reset