Building Resilient Machine Learning Applications (From HPC to Edge)

Project reference: 2104

Machine learning applications will dominate edge and mobile applications in the future. While most of the training takes place on HPC clusters and many are deployed in the cloud, some of these applications still have to run on mobile devices and edge due to security and other software requirements.

To run these applications in low energy devices such as edge devices, these models are compressed significantly to reduce both the energy and memory footprint, a process called pruning and quantization. While most of these applications are resilient to such low energy environments, a certain level of resilience is required depending on the application. To achieve this, a careful trade-off is required between the model size and accuracy.

In this work, we shall train and deploy a resilient mobile application. To do this, we shall first familiarize ourselves with training machine learning models in HPC. We shall rely on MPI for data-parallel training to accelerate the training process.

After training, we shall build a small android application and convert the model to a mobile application. We shall then follow best practices to optimise the mobile application for both energy efficiency and resilience.

Project Mentor: Leonardo Bautista Gomez

Project Co-mentor: Albert Njoroge Kahira

Site Co-ordinator: Maria-Ribera Sancho and Carolina Olmopenate

Participants: Mehmet Enes Erciyes, Jakub Raczyński

Learning Outcomes:
The students will learn the foundations of Deep Learning and training Deep Learning Models in HPC clusters. They will also learn how to create Machine Learning Mobile Applications.

Student Prerequisites (compulsory):
Proficient with Python
Proficient with Linux

Student Prerequisites (desirable):
Familiarity with Tensorflow or Pytorch is an added advantage.
Experience or familiarity building Android applications.

Training Materials:
https://mpitutorial.com/tutorials/
https://pytorch.org/tutorials/
https://arxiv.org/pdf/2012.00825.pdf

Workplan:
1^st Week: Training Week
2^nd Week: Getting familiar with MareNostrum
3^rd Week: Fundamentals of Deep Learning (CNN training and inference)
4^th Week: Distributed Machine Learning Training
5^th Week: Pruning trained models
6^th Week: Quantisation of trained models
7^th Week: Android Application for Deep Learning
8^th Week: Final Report and wrap up

Final Product Description:
1) A tool for training Deep Learning applications in HPC systems
2) A mobile DL application that is resilient to errors

Adapting the Project: Increasing the Difficulty:
A web app can be built and hosted in cloud to supplement the mobile application.

Adapting the Project: Decreasing the Difficulty:
We can remove the mobile application part and focus solely on training Machine Learning models in HPC clusters.

Resources:
Students will have access to the MareNostrum Supercomputer and specifically, the GPU based cluster called Power9.
Python will be primarily used as a Programming language and all the other software required for the project will be installed on MareNostrum.

*Online only in any case

Organisation:
BSC – Barcelona Supercomputing Center

Building Resilient Machine Learning Applications (From HPC to Edge)

Participants 2022

Latest podcasts

Building Resilient Machine Learning Applications (From HPC to Edge)

Participants 2022

Tag cloud

Latest podcasts