Time series monitoring of HPC job queues

Project reference: 2018

The project goal is creating a monitoring system to capture real-time information regarding the number of jobs running on HPC clusters, while integrating the work with DevOps and Continuous Integration practices and tools. The information regarding the status of the job queues will be collected, processed and stored as time series and summarized and made available to users and administrators in the form of graphs.

The system will specifically monitor job queues for the HPC scheduler SLURM, and will be build around Prometheus for data collection and time series storage, and Grafana for visualization. It will require the development of a Prometheus exporter using Go as preferred language choice, although Python is also an option.

The code will be hosted at SURFsara’s GitLab server and it will include automatic tests and code for tools such as Ansible for automatic deployment and configuration. It will be encouraged the use of Continuous Integration practices and tools to automate the development, testing and release processes of the code.

The distribution of the resulting code will depend on the language of choice: statically compiled binary for Go or PIP package for Python. The deployment using containers such as Docker or Singularity is also contemplated. These deliverable items will be automatically generated by CI pipelines.

The student will be assisted in the installation and configuration of the Prometheus and Grafana services, the integration in the SURFsara’s CI tools and services and the deployment and CI setup of the code written as part of the project.

Basic project goals, meme format.

Project Mentor: Juan Luis Font Calvo

Project Co-mentor: Nikolaos Parasyris

Site Co-ordinator: Carlos Teijeiro Barjas

Participants: Cathal Corbett, Joemah Magenya

Learning Outcomes:
The student will learn about the development of monitoring software for HPC environments, supporting it with CI/CD techniques and tools.

By the end of project, the student will have been involved in all the parts of the monitoring process, from the acquisition of the metrics, transmission, storage and visualization.

Student Prerequisites (compulsory):

Programming and software development notions
Familiarity with GNU/Linux systems and CLI
Basic Shell/bash scripting

Student Prerequisites (desirable):

Background in engineering or computer sciences
Knowledge of Python or Go programming languages
Monitoring, preferable Prometheus and/or Grafana
Familiar with VCS tools: git, GitLab, GitHub, …
Familiar with Continuous Integration concepts and tools

Training Materials:

Go resources: https://www.golang-book.com/
Prometheus documentation: https://prometheus.io/docs/
GitLab CI/CD: https://docs.gitlab.com/ee/ci/
Prometheus exporters: https://prometheus.io/docs/instrumenting/writing_exporters

Workplan:

Weeks 1-2: training, getting accounts and setting up development environment, analysis of project requirement
Week 3-7: Development of Prometheus exporter, tests and CI pipeline. Configuration of an associated Grafana dashboard
Week 8: project wrap up, documentation and report writing and submission

Final Product Description:
The expected results are the development of a monitoring a monitoring system (Prometheus + Grafana) for HPC job schedulers.

All the software written during the project will be supported by CI/CD techniques and tools (automatic testing, pipelines, automatic deployment, …).

Adapting the Project: Increasing the Difficulty:
The difficulty of the project could be increased by applying more sophisticated statistical analysis to identify trends in the job queues, as well as including support for other HPC schedulers such as TORQUE

Adapting the Project: Decreasing the Difficulty:
The difficulty can be decreased by reducing the complexity of the CI configuration (simpler pipelines and automatic testing), as well as opting for a programming language with which the student is more familiar and proficient.

Resources:

computer for development running GNU/Linux or MacOS
account on SURFsara GitLab server
access to SURFsara HPC/HTC clusters (DDP team)

The above requirements can be provided by the Internal Services department (laptop and user account) and the DDP team (access to HPC/HTC clusters)

Organisation:
SURFsara B.V.

Time series monitoring of HPC job queues

Participants 2022

Latest podcasts

Time series monitoring of HPC job queues

Participants 2022

Tag cloud

Latest podcasts