E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Deployment codes for Anomaly Diagnosis Framework

author & maintainer = "Burak Aksar"

email = "baksar@bu.edu"

version = "1.0.0"

Install Virtual Environment:

1-) Create a local virtual environment in the folder

python3 -m venv ml_venv

2-) Activate venv

source ml_venv/bin/activate/

3-) Install requirements

pip install -r requirements.txt

Running:

Run the jupyter notebook inside the venv, not in your local

./ml_venv/bin/jupyter notebook

Under the analysis folder you will find necessary scripts to replicate unknown apps, unknown inputs and the defauly anomaly diagnosis experiments.

The predict.py can be used to train a model and then you can use the RuntimePredictor class under runtime folder.

At a high level, E2EWatch requires the following components to provide diagnosis results at runtime in another production system:

Monitoring framework that can collect numeric telemetry data from compute nodes while applications are running. Even though we only experiment with LDMS, it can be adapted to other popular monitoring frameworks such as Ganglia, Examon by modifying the wrappers in the data collection phase.
Labeled data that is composed of anomalous and normal compute node telemetry data. It is possible to create labeled data sets using a suite of applications and synthetic anomalies. Another option is to use telemetry data labeled by users.
Backend web service that can provide telemetry data on the fly to the trained model. We use the existing Django web application deployed on the monitoring server. It is possible to use other backend web services that can handle client requests and query data from the database. If runtime diagnosis is not necessary, it is also possible to run the pickled model after the application run is completed.

Authors

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC System

Authors: Burak Aksar (1), Benjamin Schwaller (2), Omar Aaziz (2), Vitus J. Leung (2), Jim Brandt (2), Manuel Egele (1), Ayse K. Coskun (1)

Affiliations: (1) Department of Electrical and Computer Engineering, Boston University (2) Sandia National Laboratories

This work has been partially funded by Sandia National Laboratories. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under Contract DENA0003525.

License

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
analysis		analysis
fast_features		fast_features
runtime		runtime
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pipeline.jpg		pipeline.jpg
requirements.txt		requirements.txt
sample_data.csv		sample_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Install Virtual Environment:

Running:

Authors

License

About

Releases

Packages

Contributors 3

Languages

License

peaclab/E2EWatch

Folders and files

Latest commit

History

Repository files navigation

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems

Install Virtual Environment:

Running:

Authors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages