PRACE summer school is advancing fast! In this post, I’ll be sharing some of the experience acquired during this summer. In my own case, I’ve been asked to develop a plugin for Slurm, the most adopted job scheduler in HPC centres. My project aims at enhancing the energy reporting capabilities of Slurm.
Slurm allows external interaction with its functions in two different ways: the Slurm plugin API or SPANK. As regards the first one, Slurm provides a different interface for each type of task the plugin is meant for (e.g. accounting storage/energy, authentication). To develop using such an approach, the Slurm documentation suggests going through their source code and the plugins already implemented. Instead, SPANK which stands for Slurm Plug-in Architecture for Node and job (K)control, provides an easier approach, where there is no need to access the source code. The spank.hheader file lists the method signatures available, once implemented, such methods will be called by Slurm at runtime.
To get started, we need an operational system and above all, the root permission on it. Even if you have both available when building a plugin it is quite easy to drain one of the nodes, due to bugs in the code (unless you’re a ninja developer). For such a reason working in a local environment is much better during the development phase. This is why I’ve been using (and I recommend) the puppet module developed by the HPC research group at the University of Luxembourg. In the pure spirit of DevOps, it allows to easily build a cluster on your machine and test the plugin in a safe environment. It leverages Virtualbox/VMware to develop a virtualized HPC infrastructure. Details are available in the open source repo. Such an approach, lets you the freedom to launch ‘vagrant destroy‘ and start with a new cluster in less than one hour if things are going bad.
If you want to tune your local cluster configuration and be faster in the in accessing the different virtualized nodes where the plugin is compiled & tested, I would suggest using my configuration for tmuxp. By launching the YAML file here provided, two windows inside the tmux session will be open, the first one will have 4 panes, one for each entity of the cluster connected through SSH: access (front-end node), slurm-master (management node) and node-[1,2] (computation nodes). The second window will show a local terminal. Initializing a tiny HPC cluster on my laptop takes up to 40 minutes if you’re launching for the first time, 2 minutes otherwise.
For the project here developed, we have chosen to implement the plugin using the SPANK interface. Once you have fine-tuned your cluster, you can start writing your code and compile it as a shared library. Details on how to install a plugin are provided in the Slurm documentation, but be sure to repeat the process on each node, where the plugin will be run.
Complete instructions on how to interact and build your SPANK plugin are provided in their documentation. I hope that this introduction has provided you with the possibility of starting your Slurm plugin in the easiest way. After a few days of testing and understanding the environment, you can start working on your idea!
Computer engineer graduate at the University of Padova. Passionate about coding since day 0. Spending the summer at the University of Luxembourg, developing a plugin to enhance the Slurm energy reporting capabilities.