High Performance System Analytics
Project reference: 2217
We propose to analize operational logs with machine learning methods. Starting from our previous work where we characterised the operational logs of our cluster computer LISA , we propose a continuation that would bring in scope additional data sources, like EAR  and XALT . We plan to evaluate an anomaly detection method on this newly created dataset. Due to the size and complexity of data, training the machine learning model will require access and use of the Dutch supercomputer Snellius.
Project Mentor: Damian Podareanu
Project Co-mentor: Caspar van Leeuwen
Site Co-ordinator:Carlos Teijeiro Barjas
Students will learn how to tackle a real-life problem with a machine learning approach at scale. They will learn what are the challenges in curating data relevant for the given problem.
Student Prerequisites (compulsory):
Student must be familiar with Python and basic machine learning techniques.
Student must be comfortable with Linux based systems and batch schedulers (SLURM)
Student Prerequisites (desirable):
Basic knowledge about deep neural networks is a plus.
Systems knowledge is a plus
- https://arxiv.org/abs/2107.11832 (original project paper)
- https://github.com/sara-nl/SURFace (code)
- https://zenodo.org/record/3878143#.Yg_Xb4zMJl8 (data)
(simple) VQVAE for anomaly detection: https://arxiv.org/abs/2012.06765
Week 1: training week
Week 2: literature review and getting up to date with previous efforts
Week 3: update plan and practical code setup
Week 4&5: adding new data to the old collection
Week 6&7: experiments with anomaly detection methods on the collected data
Week 8 prepare final report & wrap-up
This can be done by 1 or 2 students. In the case of 2 students, we will add both EAR and XALT, otherwise we’ll chose one of them, in consultation with the student
Final Product Description:
We expect a proof of concept for the task of identifying anomalies in our operational logs
Adapting the Project: Increasing the Difficulty:
We can evaluate a more complicated method like a hierarchical VQVAE.
Adapting the Project: Decreasing the Difficulty:
We could stop after the data curation and merger and forgo the anomaly detection.
Students will require access to Snellius (supercomputer) and LISA(cluster computer) – provided by SURF.