Building Resilient Machine Learning Applications (From HPC to Edge)

Project reference: 2104
Machine learning applications will dominate edge and mobile applications in the future. While most of the training takes place on HPC clusters and many are deployed in the cloud, some of these applications still have to run on mobile devices and edge due to security and other software requirements.
To run these applications in low energy devices such as edge devices, these models are compressed significantly to reduce both the energy and memory footprint, a process called pruning and quantization. While most of these applications are resilient to such low energy environments, a certain level of resilience is required depending on the application. To achieve this, a careful trade-off is required between the model size and accuracy.
In this work, we shall train and deploy a resilient mobile application. To do this, we shall first familiarize ourselves with training machine learning models in HPC. We shall rely on MPI for data-parallel training to accelerate the training process.
After training, we shall build a small android application and convert the model to a mobile application. We shall then follow best practices to optimise the mobile application for both energy efficiency and resilience.
Project Mentor: Leonardo Bautista Gomez
Project Co-mentor: Albert Njoroge Kahira
Site Co-ordinator: Maria-Ribera Sancho and Carolina Olmopenate
Participants: Mehmet Enes Erciyes, Jakub Raczyński
Learning Outcomes:
The students will learn the foundations of Deep Learning and training Deep Learning Models in HPC clusters. They will also learn how to create Machine Learning Mobile Applications.
Student Prerequisites (compulsory):
Proficient with Python
Proficient with Linux
Student Prerequisites (desirable):
Familiarity with Tensorflow or Pytorch is an added advantage.
Experience or familiarity building Android applications.
Training Materials:
https://mpitutorial.com/tutorials/
https://pytorch.org/tutorials/
https://arxiv.org/pdf/2012.00825.pdf
Workplan:
1st Week: Training Week
2nd Week: Getting familiar with MareNostrum
3rd Week: Fundamentals of Deep Learning (CNN training and inference)
4th Week: Distributed Machine Learning Training
5th Week: Pruning trained models
6th Week: Quantisation of trained models
7th Week: Android Application for Deep Learning
8th Week: Final Report and wrap up
Final Product Description:
1) A tool for training Deep Learning applications in HPC systems
2) A mobile DL application that is resilient to errors
Adapting the Project: Increasing the Difficulty:
A web app can be built and hosted in cloud to supplement the mobile application.
Adapting the Project: Decreasing the Difficulty:
We can remove the mobile application part and focus solely on training Machine Learning models in HPC clusters.
Resources:
Students will have access to the MareNostrum Supercomputer and specifically, the GPU based cluster called Power9.
Python will be primarily used as a Programming language and all the other software required for the project will be installed on MareNostrum.
*Online only in any case
Organisation:
BSC – Barcelona Supercomputing Center
Leave a Reply