DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep Surrogate Model

by   Shreshth Tuli, et al.

The emergence of latency-critical AI applications has been supported by the evolution of the edge computing paradigm. However, edge solutions are typically resource-constrained, posing reliability challenges due to heightened contention for compute and communication capacities and faulty application behavior in the presence of overload conditions. Although a large amount of generated log data can be mined for fault prediction, labeling this data for training is a manual process and thus a limiting factor for automation. Due to this, many companies resort to unsupervised fault-tolerance models. Yet, failure models of this kind can incur a loss of accuracy when they need to adapt to non-stationary workloads and diverse host characteristics. To cope with this, we propose a novel modeling approach, called DeepFT, to proactively avoid system overloads and their adverse effects by optimizing the task scheduling and migration decisions. DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system and co-simulation based self-supervised learning to dynamically adapt the model in volatile settings. It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts. Extensive experimentation on a Raspberry-Pi based edge cluster with DeFog benchmarks shows that DeepFT can outperform state-of-the-art baseline methods in fault-detection and QoS metrics. Specifically, DeepFT gives the highest F1 scores for fault-detection, reducing service deadline violations by up to 37% while also improving response time by up to 9


page 1

page 7

page 9


PreGAN: Preemptive Migration Prediction Network for Proactive Fault-Tolerant Edge Computing

Building a fault-tolerant edge system that can quickly react to node ove...

DRAGON: Decentralized Fault Tolerance in Edge Federations

Edge Federation is a new computing paradigm that seamlessly interconnect...

Intelligent Proactive Fault Tolerance at the Edge through Resource Usage Prediction

The proliferation of demanding applications and edge computing establish...

Self-healing Dilemmas in Distributed Systems: Fault-correction vs. Fault-tolerance

Large-scale decentralized systems of autonomous agents interacting via a...

5G Enabled Fault Detection and Diagnostics: How Do We Achieve Efficiency?

The 5th-generation wireless networks (5G) technologies and mobile edge c...

Oakestra white paper: An Orchestrator for Edge Computing

Edge computing seeks to enable applications with strict latency requirem...

GOSH: Task Scheduling Using Deep Surrogate Models in Fog Computing Environments

Recently, intelligent scheduling approaches using surrogate models have ...

Please sign up or login with your details

Forgot password? Click here to reset