Job execution dashboard for HPC
During my summer of HPC I combined a low level acquisition program with a database to create a beautiful dashboard monitoring jobs runtime. In other word, visualize in real time execution of parallelized program
(A better view for saving energy in HPC using a low level acquisition program)
Programs (called Jobs) used in supercomputers consume a lot of energy which it is a limiting factor for scalability of these computers. These programs parallelize the computations across multiple “nodes” across multiples cores of CPUs. And they need to be synchronized during their computations to coordinate. However, such parallelized applications often waste CPUs power during synchronization of the processes.
This is where COUNTDOWN comes in ! It is a tool which reduces the CPUs frequency (therefore reducing energy consumption) during idle / synchronisation time while trying to be as impactless as possible.
In order to helps users, understand how their programs are running with COUNTDOWN and how much energy they use, my goal is to design a dashboard to monitor programs which are using Countdown and extend Countdown to send all the necessary information about the job.
Overview of existing solutions
MPI Profilers (which are programs used to get debug data from MPI programs) already exists from IBM or Intel. However, they comes at a cost:
- They lead to perturbations of the performance of jobs
- They report the data only at the end of the job and not during runtime
Countdown is a MPI Profilers that tries to be as impactless as possible and can also reduce the energy consumption of a given job.
It is developed in CINECA (+link) and can produces timeseries reports. This is useful especially to get runtime profiling.
What is ExamonDB
Examon is a distributed and scalable monitoring infrastructure with a database.
By using Countdown with Examon it is possible to send MPI Profiling information during runtime to the Examon database.
Then it become only a matter of design to create a beautiful dashboard using the open source, state of the art, interactive data visualization, web application: Grafana.
The Grafana dashboard
The dashboard automatically set the timerange according the the Job ID specified!
The design of the dashboard is split into 3 different parts:
- The Summary :
This is where all the general information about the Job are and some global features such as the execution time and the total energy used.
- The Timeseries :
It is the part of the dashboard with all the runtime information
- The GPU
If enabled, Countdown can also profile GPUs data