Graphical interface for real time monitoring, automatic event detection, and alert triggering in HPC parallel software
Project reference: 1502
Large HPC physics simulations can create local signs of failure or success long before a global signature is observed. If these signs go unnoticed valuable computer resources could be spared. We have created a system that monitors software in real time looking for predefined signatures of problems. When one is detected, the software can create a fast automatic rendering of the problem area and send an alert to the user, who can evaluate the seriousness of the issue and decide whether to cancel the computation early.
In this project we will create a web accessible interface that will allow users to manage their running jobs and (if any) their current alerts. Alongside the user will also be able to monitor the global metrics of the software to evaluate general performance, including 3D representations of the computational mesh and their domain decomposition. The system is currently implemented over Alya, a metaphysics simulation software of the PRACE benchmark.
Project mentor: Fernando Cucchietti
Site Co-ordinator: Maria Ribera Sancho
The student will learn data visualization techniques and technologies, especially applied to dashboard designs.
Student Prerequisites (compulsory)
Student Prerequisites (desirable)
Experience with D3.js, Websockets, or Node.js
- Week 1: Training week
- Week 2: Literature Review Preliminary Report (Plan writing)
- Week 3 – 7: Project Development
- Week8: Final Report write-up
Final Product Description
The final product will allow the real time monitoring of a production scale HPC software, and prevent the unnecessary use of HPC resources.
Adapting the Project – Increasing the Difficulty
The project is on the appropriate cognitive level, taking into account the timeframe and the need to submit final working product and 2 reports.
Adapting the Project – Decreasing the Difficulty
The Interface will be designed in full but some of the features may not be developed to ensure working product with some limited features at the end of the project.
The student will need access to standard computing resources (laptop, internet connection) as well as an account in Marenostrum.
Barcelona Supercomputing Centre