Machine Learning for the rescheduling of SLURM jobs

Project reference: 2015

The broad userbase of clusters is not familiar with the ins-and-outs of the systems they are working on and such familiarity is not really necessary.

As a result many will enqueue jobs with maximum available run time
rather than trying to provide good estimates on the execution time of their jobs. In the best case this leads to the job’s priority being low and the starting time for the job permanently moving into the future.

In the worst case this may lead to idle resources due to the scheduler preventing the launch of shorter jobs due to priority constraints.

Project Objectives: Use simple machine learning frameworks (e.g. TensorFlow) to estimate the execution time of a user’s jobs based on:

The user’s history
Job name
Job being part of an array.

At first it will be sufficient to try and provide tighter bounds on the execution time of array jobs. On the next step one would try to update the run-time bounds for the enqueued remainder of an array job. Finally, should the above prove fruitful, the approach will be extended to all of the user’s jobs and an additional flag to SLURM or a wrapper script should be implemented to allow a user to opt-in to such automatic run-time updates. The approach will be tested on real life problems and HPC loads.

Project Mentor: Vassil Alexandrov

Project Co-mentor: Anton Lebedev

Site Co-ordinator: Luke Mason

Participants: Francesca Schiavello, Ömer Faruk Karadaş

Learning Outcomes:
The student will learn first a variety of advanced ML approaches and how to apply these in conjunction with workload managers such as SLURM. The developed programs will be applied to synthetic and real HPC workloads for clusters of varying size.

Student Prerequisites (compulsory):
Good knowledge of Python.

Student Prerequisites (desirable):
TensorFlow or other ML frameworks with Python interfaces.
Basics of SLURM usage.

Training Materials:
These can be tailored to the student once he/she is selected, but the SLURM user’s manual will be an integral part of the materials.

Workplan:

Week 1/: Training week
Week 2/: Literature Review Preliminary Report (Plan writing)
Week 3 – 7/: Project Development and Evaluation
Week8/: Final Report write-up

Final Product Description:
The final product will be an enhanced SLURM workload manager tested on real life problems and HPC loads.

Adapting the Project: Increasing the Difficulty:
The project is on the appropriate cognitive level, taking into account the timeframe and the need to submit final working product and 2 reports.
Provided sufficient progress, the students input for the development objectives may be included.

Adapting the Project: Decreasing the Difficulty:
The topic will be researched and the final product will be designed in full but some of the features may not be developed to ensure working product with some limited features at the end of the project.

Resources:
The student will need access to relevant HPC machine to test the approach which can be GPU and/or a multicore based one, standard computing resources (laptop, internet connection).

Organisation:
Hartee Centre – STFC

Image result for Hartree Centre - STFC

Machine Learning for the rescheduling of SLURM jobs

Participants 2022

Latest podcasts

Machine Learning for the rescheduling of SLURM jobs

Participants 2022

Tag cloud

Latest podcasts