Energy Reporting in Slurm Jobs

Energy Reporting in Slurm Jobs
Macro shot of computer chip silicon wafer. CPU and GPU chips are a supercomputer’s top’s energy consumers. Visual material license: free for use, no attribution required, source: https://unsplash.com/photos/nIEHqGSymRU

Project reference: 1924

Energy consumption is one of the largest problems faced by modern supercomputing in the race to build Exaflop/s capable systems. In the past decade hardware design focus has shifted from obtaining the best possible performance to improving the performance-per-watt, with each new hardware generation bringing better energy efficiency while delivering moderately increased performance. Hardware/software co-design of next-generation supercomputers is now seen as a necessary step in order to reach the Exaflop/s milestone, with systems or modules being purpose-built for specific applications. To enable co-design, existing and under-development applications’ behavioral patterns need to be monitored and understood, with key metrics reported both for human review and profiling by automated systems.

This project will focus on developing a plugin for the Slurm Workload Management system commonly used to schedule user jobs in HPC centers. The new plugin will generate reports containing energy usage, memory, I/O, and other metrics for the user jobs that request it. Slurm natively provides a generic interface for stackable plugins which may be used to dynamically modify the job launch code in Slurm and interact with the job in the context of its prolog, epilog or task launch [1].

The plugin will extend Slurm’s inbuilt facilities for metrics collection and reporting [2], with a special focus on interaction and integration with external energy consumption measurement frameworks and performance tools/interfaces such as LIKWID [3] and PAPI [4]. Energy usage metrics will be extracted at the node level using IPMI [5], and at a more fine-grained level for Intel CPUs with RAPL [5] and NVIDIA GPUs using NVML [6].

Metrics reporting will be generated for both human and automated systems consumption, with an emphasis on ease of use and readability.

[1] Slurm Plug-in Architecture for Node and job (K)control: https://slurm.schedmd.com/spank.html

[2] Slurm accounting data: https://slurm.schedmd.com/sacct.html

[3] Likwid Performance Tools: http://hpc.fau.de/research/tools/likwid

[4] Performance Application Programming Interface: http://icl.utk.edu/papi

[5] Slurm Energy Accounting Plugin API: https://slurm.schedmd.com/acct_gather_energy_plugins.html

[6] NVIDIA Management Library: https://developer.nvidia.com/nvidia-management-library-nvml

Macro shot of computer chip silicon wafer. CPU and GPU chips are a supercomputer’s top’s energy consumers.
Visual material license: free for use, no attribution required, source:
https://unsplash.com/photos/nIEHqGSymRU

Project Mentor: Valentin Plugaru

Project Co-mentor: Sebastien Varrette

Site Co-ordinator: Prof. Pascal Bouvry

Participant: Matteo Stringher

Learning Outcomes:
The student will interact deeply with the resource management system powering HPC systems. He or she will understand serial and parallel job executions, key performance indicators, metrics collection and efficiency reporting. He or she will have access to historic job traces and will be able to build and test code in a HPC environment.

Student Prerequisites (compulsory):

  • Background in software development and algorithms/data structures fundamentals.
  • C/C++ programming experience.

Student Prerequisites (desirable):

  • Advanced knowledge of C/C++.
  • Experience with Slurm scheduler.

Training Materials:

Workplan:

  • Week 1/: Training week
  • Week 2/:  Literature Review Preliminary Report (Plan writing)
  • Week 3 – 7/: Project Development
  • Week 8/: Final Report write-up

Final Product Description:
The final product will be a Slurm metrics collection and reporting plugin together with an implementation report including the output of test executions of real scientific codes ran with the plugin activated.

A high quality work will be released as open-source contribution to the community and possibly as a conference workshop article.

Adapting the Project: Increasing the Difficulty:
The developed plugin is able to interact with the task (Slurm steps) launch and wrap around (serial/parallel) user application for deep inspection and reporting on performance counters. The plugin is able to write time series data in HDF5 format.

Adapting the Project: Decreasing the Difficulty:
The developed plugin uses only existing metrics collected by Slurm and generates an end-of-job report in a text format.

Resources:
The student will need access to standard computing resources (laptop able to run virtualization systems, internet connection) as well as an account on the Iris supercomputer of the University of Luxembourg.

Organisation:
University of Luxembourg

Tagged with: , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.