Energy Reporting in Slurm Jobs

Project reference: 1924
Energy consumption is one of the largest problems faced by modern supercomputing in the race to build Exaflop/s capable systems. In the past decade hardware design focus has shifted from obtaining the best possible performance to improving the performance-per-watt, with each new hardware generation bringing better energy efficiency while delivering moderately increased performance. Hardware/software co-design of next-generation supercomputers is now seen as a necessary step in order to reach the Exaflop/s milestone, with systems or modules being purpose-built for specific applications. To enable co-design, existing and under-development applications’ behavioral patterns need to be monitored and understood, with key metrics reported both for human review and profiling by automated systems.
This project will focus on developing a plugin for the Slurm Workload Management system commonly used to schedule user jobs in HPC centers. The new plugin will generate reports containing energy usage, memory, I/O, and other metrics for the user jobs that request it. Slurm natively provides a generic interface for stackable plugins which may be used to dynamically modify the job launch code in Slurm and interact with the job in the context of its prolog, epilog or task launch [1].
The plugin will extend Slurm’s inbuilt facilities for metrics collection and reporting [2], with a special focus on interaction and integration with external energy consumption measurement frameworks and performance tools/interfaces such as LIKWID [3] and PAPI [4]. Energy usage metrics will be extracted at the node level using IPMI [5], and at a more fine-grained level for Intel CPUs with RAPL [5] and NVIDIA GPUs using NVML [6].
Metrics reporting will be generated for both human and automated systems consumption, with an emphasis on ease of use and readability.
[1] Slurm Plug-in Architecture for Node and job (K)control: https://slurm.schedmd.com/spank.html
[2] Slurm accounting data: https://slurm.schedmd.com/sacct.html
[3] Likwid Performance Tools: http://hpc.fau.de/research/tools/likwid
[4] Performance Application Programming Interface: http://icl.utk.edu/papi
[5] Slurm Energy Accounting Plugin API: https://slurm.schedmd.com/acct_gather_energy_plugins.html
[6] NVIDIA Management Library: https://developer.nvidia.com/nvidia-management-library-nvml

Macro shot of computer chip silicon wafer. CPU and GPU chips are a supercomputer’s top’s energy consumers.
Visual material license: free for use, no attribution required, source:
https://unsplash.com/photos/nIEHqGSymRU
Project Mentor: Valentin Plugaru
Project Co-mentor: Sebastien Varrette
Site Co-ordinator: Prof. Pascal Bouvry
Participant: Matteo Stringher
Learning Outcomes:
The student will interact deeply with the resource management system powering HPC systems. He or she will understand serial and parallel job executions, key performance indicators, metrics collection and efficiency reporting. He or she will have access to historic job traces and will be able to build and test code in a HPC environment.
Student Prerequisites (compulsory):
- Background in software development and algorithms/data structures fundamentals.
- C/C++ programming experience.
Student Prerequisites (desirable):
- Advanced knowledge of C/C++.
- Experience with Slurm scheduler.
Training Materials:
- Energy accounting and control with SLURM resource and job management system:
https://hal.inria.fr/hal-01237596/document - A scheduler-level incentive mechanism for energy efficiency in HPC :
https://hal.archives-ouvertes.fr/hal-01230295/document
Workplan:
- Week 1/: Training week
- Week 2/: Literature Review Preliminary Report (Plan writing)
- Week 3 – 7/: Project Development
- Week 8/: Final Report write-up
Final Product Description:
The final product will be a Slurm metrics collection and reporting plugin together with an implementation report including the output of test executions of real scientific codes ran with the plugin activated.
A high quality work will be released as open-source contribution to the community and possibly as a conference workshop article.
Adapting the Project: Increasing the Difficulty:
The developed plugin is able to interact with the task (Slurm steps) launch and wrap around (serial/parallel) user application for deep inspection and reporting on performance counters. The plugin is able to write time series data in HDF5 format.
Adapting the Project: Decreasing the Difficulty:
The developed plugin uses only existing metrics collected by Slurm and generates an end-of-job report in a text format.
Resources:
The student will need access to standard computing resources (laptop able to run virtualization systems, internet connection) as well as an account on the Iris supercomputer of the University of Luxembourg.
Organisation:
University of Luxembourg
Leave a Reply