Time series monitoring of HPC job queues
Project reference: 2018
The project goal is creating a monitoring system to capture real-time information regarding the number of jobs running on HPC clusters, while integrating the work with DevOps and Continuous Integration practices and tools. The information regarding the status of the job queues will be collected, processed and stored as time series and summarized and made available to users and administrators in the form of graphs.
The system will specifically monitor job queues for the HPC scheduler SLURM, and will be build around Prometheus for data collection and time series storage, and Grafana for visualization. It will require the development of a Prometheus exporter using Go as preferred language choice, although Python is also an option.
The code will be hosted at SURFsara’s GitLab server and it will include automatic tests and code for tools such as Ansible for automatic deployment and configuration. It will be encouraged the use of Continuous Integration practices and tools to automate the development, testing and release processes of the code.
The distribution of the resulting code will depend on the language of choice: statically compiled binary for Go or PIP package for Python. The deployment using containers such as Docker or Singularity is also contemplated. These deliverable items will be automatically generated by CI pipelines.
The student will be assisted in the installation and configuration of the Prometheus and Grafana services, the integration in the SURFsara’s CI tools and services and the deployment and CI setup of the code written as part of the project.
Project Mentor: Juan Luis Font Calvo
Project Co-mentor: Nikolaos Parasyris
Site Co-ordinator: Carlos Teijeiro Barjas
The student will learn about the development of monitoring software for HPC environments, supporting it with CI/CD techniques and tools.
By the end of project, the student will have been involved in all the parts of the monitoring process, from the acquisition of the metrics, transmission, storage and visualization.
Student Prerequisites (compulsory):
- Programming and software development notions
- Familiarity with GNU/Linux systems and CLI
- Basic Shell/bash scripting
Student Prerequisites (desirable):
- Background in engineering or computer sciences
- Knowledge of Python or Go programming languages
- Monitoring, preferable Prometheus and/or Grafana
- Familiar with VCS tools: git, GitLab, GitHub, …
- Familiar with Continuous Integration concepts and tools
- Go resources: https://www.golang-book.com/
- Prometheus documentation: https://prometheus.io/docs/
- GitLab CI/CD: https://docs.gitlab.com/ee/ci/
- Prometheus exporters: https://prometheus.io/docs/instrumenting/writing_exporters
- Weeks 1-2: training, getting accounts and setting up development environment, analysis of project requirement
- Week 3-7: Development of Prometheus exporter, tests and CI pipeline. Configuration of an associated Grafana dashboard
- Week 8: project wrap up, documentation and report writing and submission
Final Product Description:
The expected results are the development of a monitoring a monitoring system (Prometheus + Grafana) for HPC job schedulers.
All the software written during the project will be supported by CI/CD techniques and tools (automatic testing, pipelines, automatic deployment, …).
Adapting the Project: Increasing the Difficulty:
The difficulty of the project could be increased by applying more sophisticated statistical analysis to identify trends in the job queues, as well as including support for other HPC schedulers such as TORQUE
Adapting the Project: Decreasing the Difficulty:
The difficulty can be decreased by reducing the complexity of the CI configuration (simpler pipelines and automatic testing), as well as opting for a programming language with which the student is more familiar and proficient.
- computer for development running GNU/Linux or MacOS
- account on SURFsara GitLab server
- access to SURFsara HPC/HTC clusters (DDP team)
The above requirements can be provided by the Internal Services department (laptop and user account) and the DDP team (access to HPC/HTC clusters)