Anomaly detection of system failures on HPC accelerated machines using Machine Learning Techniques

Anomaly detection of system failures on HPC accelerated machines using Machine Learning Techniques

Project reference: 2004

Anomalies detection is one of the timeliest problems in managing HPC facilities. Clearly, this problem involves many technological issues, including big data analysis, machine learning, virtual machines manipulations and authentication protocols.

Our research group already prepared a tool to perform real-time and historical analysis of the most important observables to maximize the energy efficiency and the maintainability of an HPC cluster.

Currently, the tool is about to be ported on CINECA’s new GPU-based petascale cluster and, when it will be in production (probably by Q2 2020), we will have access to a new dataset.  The project focuses on the data-driven automation of the new cluster. Big-data and analytics will be performed on the new dataset.

In the framework of SoHPC program, we aim to perform some preliminary test on deep learning methodologies and database to extract knowledge from the node and system-level metrics and to correlate it to with node/job failures.

Project Mentor: Andrea Bartolini

Project Co-mentor: Andrea Borghesi

Site Co-ordinator: Massimiliano Guarrasi

Participants: Aisling Paterson, Stefan Popov, Nathan Byford

Learning Outcomes:
Increase student’s skills about:

  • Big Data Analysis
  • Elasticsearch/Cassandra
  • Deep learning
  • TensorFlow
  • Open Stack VM
  • Python
  • Blender
  • HPC schedulers (particularly Slurm)
  • Internet of Things (MQTT)

HPC infrastructures

Student Prerequisites (compulsory):

  • Python or C/c++ language (python will be preferred)
  • Numpy

Student Prerequisites (desirable):

  • Numpy
  • Matplotlib
  • Pandas
  • Elastic Search
  • Tensor Flow
  • Cassandra
  • Spark
  • Blender
  • MQTT

Training Materials:


  • Week 1: Common Training session
  • Week 2: Introduction to CINECA systems, small tutorials on big data  and tensorflow and detailed work planning.
  • Week 3: Problem analysis and deliver final Workplan at the end of week.
  • Week 4, 5: Production phase (log mining and training of the neural network).
  • Week 6, 7: Final stage of production phase (Depending on the results and timeframe, the set of observables will be increased). Preparation of the final movie.
  • Week 8: Finishing the final movie. Write the final Report.

Final Product Description
A full big data to process heterogeneous information as well as a trained deep neural network to predict the node anomalies.

Adapting the Project: Increasing the Difficulty:
A simple tool to perform a live anomalies detection will be prepared and installed on a virtual machines.

Adapting the Project: Decreasing the Difficulty:
If necessary we could reduce the effort, creating only a log mining application using elasticsearch

The student will have access to our facility, our HPC systems,and the databases containing all the measurements, system logs and node status information. They could also manage a dedicated virtual machine.


Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.