High Performance System Analytics

Project reference: 2217

We propose to analize operational logs with machine learning methods. Starting from our previous work where we characterised the operational logs of our cluster computer LISA [1], we propose a continuation that would bring in scope additional data sources, like EAR [2] and XALT [3]. We plan to evaluate an anomaly detection method on this newly created dataset. Due to the size and complexity of data, training the machine learning model will require access and use of the Dutch supercomputer Snellius.
[1] https://cs.paperswithcode.com/paper/a-holistic-analysis-of-datacenter-operations
[2] https://github.com/lenovo/ear
[3] https://www.tacc.utexas.edu/research-development/tacc-projects/xalt

Project Mentor: Damian Podareanu

Project Co-mentor: Caspar van Leeuwen

Site Co-ordinator:Carlos Teijeiro Barjas

Learning Outcomes:
Students will learn how to tackle a real-life problem with a machine learning approach at scale. They will learn what are the challenges in curating data relevant for the given problem.

Student Prerequisites (compulsory):
Student must be familiar with Python and basic machine learning techniques.
Student must be comfortable with Linux based systems and batch schedulers (SLURM)

Student Prerequisites (desirable):
Basic knowledge about deep neural networks is a plus.
Systems knowledge is a plus

Training Materials:


(simple) VQVAE for anomaly detection: https://arxiv.org/abs/2012.06765


Week 1: training week
Week 2: literature review and getting up to date with previous efforts
Week 3: update plan and practical code setup
Week 4&5: adding new data to the old collection
Week 6&7: experiments with anomaly detection methods on the collected data
Week 8 prepare final report & wrap-up

This can be done by 1 or 2 students. In the case of 2 students, we will add both EAR and XALT, otherwise we’ll chose one of them, in consultation with the student

Final Product Description:
We expect a proof of concept for the task of identifying anomalies in our operational logs

Adapting the Project: Increasing the Difficulty:
We can evaluate a more complicated method like a hierarchical VQVAE.

Adapting the Project: Decreasing the Difficulty:
We could stop after the data curation and merger and forgo the anomaly detection.

Students will require access to Snellius (supercomputer) and LISA(cluster computer) – provided by SURF.


Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.