The emergence of latency-critical AI applications has been supported by the evolution of the edge computing paradigm. However, edge solutions are typically resource-constrained, posing reliability challenges due to heightened contention for compute and communication capacities and faulty application behavior in the presence of overload conditions. Although a large amount of generated log data can be mined for fault prediction, labeling this data for training is a manual process and thus a limiting factor for automation. Due to this, many companies resort to unsupervised fault-tolerance models. Yet, failure models of this kind can incur a loss of accuracy when they need to adapt to non-stationary workloads and diverse host characteristics. To cope with this, we propose a novel modeling approach, called DeepFT, to proactively avoid system overloads and their adverse effects by optimizing the task scheduling and migration decisions. DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system and co-simulation based self-supervised learning to dynamically adapt the model in volatile settings. It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts. Extensive experimentation on a Raspberry-Pi based edge cluster with DeFog benchmarks shows that DeepFT can outperform state-of-the-art baseline methods in fault-detection and QoS metrics. Specifically, DeepFT gives the highest F1 scores for fault-detection, reducing service deadline violations by up to 37% while also improving response time by up to 9%.
Clone repo.
git clone https://github.com/imperial-qore/DeepFT.git
cd PreGAN/
Install dependencies.
sudo apt -y update
python3 -m pip --upgrade pip
python3 -m pip install matplotlib scikit-learn
python3 -m pip install -r requirements.txt
python3 -m pip install torch==1.7.1+cpu torchvision==0.8.2+cpu -f https://download.pytorch.org/whl/torch_stable.html
export PATH=$PATH:~/.local/bin
Change line 118 in main.py
to use one of the implemented fault-tolerance techniques: DeepFTRecovery
, PCFTRecovery
, DFTMRecovery
, ECLBRecovery
, AWGGRecovery
or TopoMADRecovery
and run the code using the following command.
python3 main.py
Items | Contents |
---|---|
Pre-print | (coming soon) |
Contact | Shreshth Tuli (@shreshthtuli) |
Funding | Imperial President's scholarship |
Our work is accepted in IEEE Conference on Computer Communications (INFOCOM) 2023. Cite our work using the bibtex entry below.
@inproceedings{tuli2022deepft,
title={{DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep Surrogate Model}},
author={Tuli, Shreshth and Casale, Giuliano and Cherkasova, Ludmila and Jennings, Nicholas R},
booktitle={IEEE Conference on Computer Communications (INFOCOM)},
year={2023},
organization={IEEE}
}
BSD-3-Clause. Copyright (c) 2022, Shreshth Tuli. All rights reserved.
See License file for more details.