Your Slurm job has been terminated

Your Slurm job has been terminated

Wow, the Summer of HPC is gone so fast. The little country of Luxembourg has welcomed us only a few weeks ago and now it’s time to go back. The last moments at the HPC group were full of things to deliver and only a little time was left for the last commits on the repo. Presentation, final video, report and source code documentation have filled our last days.

The code developed at the SoHPC has reached a final state and now it’s under testing before production. We are using the remaining time to test it on the Iris cluster and strengthen the source code. Even though I have worked and tested the plugin on the virtualized machine, the shift to the physical cluster is still a big step.

The code has been developed and tested through three stages:

  • In the first one, we have used the virtualized cluster to ensure to run a safe code on the physical one. This development part took most of the project time.
  • In the second one, a prototype of the code has been run on the Iris cluster without affecting the Slurm configuration. The core data retrieval functions at this stage were able to collect information from all the GPUs on the mainboard.
  • In the last phase, the plugin was fully assembled and tested by installing it on the Iris cluster and executed on compute nodes featuring dual Skylake CPUs and quad Volta V100 GPUs.

During the last days, I and Valentin have deeply tested the plugin and have run different tests to ensure that everything was working correctly. As an example, here below one of the latest runs. A neural network has been trained for 10 epochs on the MNIST dataset to analyze the consumption in a parallel environment.

Job stages:
1) 0-5s: loading software environment (lmod), Singularity container initialization;
2) 5-12s: TensorFlow-GPU initialization, data load disk to memory
3) 12-36s: TensorFlow 10 epochs training
4) 36s: job spindown

Gpu0 is the most stressed one, showing many spikes that couldn’t be highlighted with a low sampling frequency. Anyways it must be underlined, that according to the NVIDIA documentation, the measurements are subject to error. Complete information about my project can be found in the report and in the open-source repo.

So the summer, unfortunately, is over. During this internship, I had the possibility to learn a lot from experienced programmers and HPC professional, deepen my knowledge and discover new software and programming environments.

A nice view in Luxembourg city during sunset

Now it’s time to celebrate and have a last beer with colleagues and all the people met during this short experience. I would definitely recommend the Summer of HPC to my Computer Engineering mates. There are not many programmes that allow you to work in so many different places. The HPC centres are spread all over Europe (they range indeed from west to east and from north to south) and it’s a great occasion to experience a new job environment. So, apply for the next edition of the SoHPC, fun is guaranteed!

Please follow and like us:
error

Computer engineer graduate at the University of Padova. Passionate about coding since day 0. Spending the summer at the University of Luxembourg, developing a plugin to enhance the Slurm energy reporting capabilities.

Tagged with: , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.