Project reference: 1810
Energy Efficiency is one of the timeliest problems in managing HPC facilities. Clearly this problem involves many technological issues, including, web visualization, interaction with HPC systems and schedulers, big data analysis, virtual machines manipulations, and authentication protocols.
Our research group prepared a tool to perform real-time and historical analysis of the most important observables to maximize the energy efficiency of an HPC cluster. In the framework of SoHPC 2017, we already developed a web interface that will show the required observable (e.g. the numbers of job running, average temperature of the CPUs per job, top temperatures and other architectural metrics) in the desired time interval.
The web page includes also some statistics about the jobs running in the selected system and a 3D view of the energy load and other observables of the GALILEO cluster.
This year in the framework of the SOHPC program, we aim to improve the capability of this tool and adapt it to the Tier0 Marconi cluster and KNL architecture. We aim also to realize a web interface to plot the energy efficiency observables of the D.AV.I.D.E. cluster based on POWER8 + Nvidia GPUs architecture.
Moving to Tier0 systems will cause an increase in the amount of data produced which will be hard to process visually from system administrators. Deep Learning and Artificial Intelligence can cope with this amount of data and predict and detect anomalies as well as spot for application and system optimization. The student will learn how to combine Deep Learning tools and algorithms with the monitored data to identify possible observable useful to predict possible cluster fault.
Clearly, this tool could help HPC system administrators to:
- optimize the energy consumption and performance of their machines;
- avoid unexpected fault and anomalies of the machines.
Project Mentor: Dr. Andrea Bartolini
Project Co-mentor: Dr. Giuseppa Muscianisi
Site Co-ordinator: Dr. Massimiliano Guarrasi
Increase student’s skills about:
- Big Data Analysis
- Deep learning
- Open Stack VM
- HPC schedulers (e.g. Slurm)
- Internet of Things (MQTT)
- HPC infrastructures
- Energy efficiency
Student Prerequisites (compulsory):
- Python or C/c++ language (python will be preferred)
Student Prerequisites (desirable):
- Week 1: Common Training session;
- Week 2: Introduction to CINECA systems, small tutorials on parallel visualization and detailed work planning;
- Week 3: Problem analysis and deliver final Workplan at the end of week;
- Week 4, 5: Production phase (set-up of the web page);
- Week 6, 7: Final stage of production phase (Depending on the results and timeframe, the set of observables will be increased). Preparation of the final movie;
- Week 8: Finishing the final movie. Write the final Report.
Final Product Description:
An interactive web page will be created. This page will show:
- a series of parameters related to energy efficiency of a HPC system;
- some statistics about the jobs running in the selected system;
- a data analytics tool to search on the databases of observables:
- an interactive 3D plot of Marconi system.
Adapting the Project: Increasing the Difficulty:
The student will help us to prepare a 3D rendering of the MARCONI cluster, via Blender4Web in order to visualize the energy load directly on the cluster. The student will work with Deep Neural Network algorithm to study predictive anomaly detection in the Marconi cluster.
Adapting the Project: Decreasing the Difficulty
If necessary, we can reduce the effort, creating only a webpage showing job running in the cluster, and some other statistical information extracted form oud DB.
The student will have access to our facility, our HPC systems, the two databases containing all energy parameters, and job information. They will also manage a dedicated virtual machine.