Anomaly detection of system failures on HPC machines using Machine Learning Techniques

Project reference: 1906
Anomalies detection is one of the timeliest problems in managing HPC facilities. Clearly, this problem involves many technological issues, including big data analysis, machine learning, virtual machines manipulations and authentication protocols.
Our research group already prepared a tool to perform real-time and historical analysis of the most important observables to maximize the energy efficiency of an HPC cluster.
We want to merge the data obtained from this tool with the system logs searching for some observables that could anticipate system failures.
In the framework of SoHPC program, we aim to perform some preliminary test on deep learning methodologies and database for log mining to data extract knowledge from node logs and to correlate it to with node failures.
The first step will be to implement a system to collect data from system logs using Elastic search. These data will be used to train a neural network to find a correlation between system faults and system observables.
Project Mentor: Andrea Bartolini
Project Co-mentor: Andrea Borghesi
Site Co-ordinator: Massimiliano Guarrasi
Participant: Martin Molan
Learning Outcomes:
Student Prerequisites (compulsory):
- Python or C/c++ language (python will be preferred)
Student Prerequisites (desirable):
- Elastic Search
- Tensor Flow
- Cassandra
- Spark
- Blender
- MQTT
Training Materials:
None.
Workplan:
- Week 1: Common Training session
- Week 2: Introduction to CINECA systems, small tutorials on elastic search and tensor flow and detailed work planning.
- Week 3: Problem analysis and deliver final Workplan at the end of the week.
- Week 4, 5: Production phase (log mining and training of the neural network).
- Week 6, 7: Final stage of production phase (Depending on the results and timeframe, the set of observables will be increased). Preparation of the final movie.
- Week 8: Finishing the final movie. Write the final report.
Final Product Description:
A full big data to process log information as well as a trained deep neural network to predict the node anomalies.
Adapting the Project: Increasing the Difficulty:
A simple tool to perform a live anomalies detection will be prepared and installed on virtual machines.
Adapting the Project: Decreasing the Difficulty:
If necessary we could reduce the effort, creating only a log mining application using elastic search
Resources:
The student will have access to our facility, our HPC systems, the databases containing all the measurements, system logs and node status information. They could also manage a dedicated virtual machine.
Leave a Reply