Performance analysis of Distributed and Scalable Deep Learning

Performance analysis of Distributed and Scalable Deep Learning
Brain-shaped representation of artificial neural network, showing interlinked neurons depicting distributed computing elements. Visual material license: free for use, no attribution required, source:

Project reference: 1925

With renewed global interest for Artificial Intelligence (AI) methods, the past decade has seen a myriad of new programming models and tools that enable better and faster Machine Learning (ML). ML is in a nutshell “the science of getting computers to act without being explicitly programmed” and is permeating an ever-expanding range of human activities. Deep Learning (DL) is an ML technique used in cars with self-driving features such as the Tesla Autopilot and Mercedes Drive Pilot. DL is at the core of state of the art voice recognition systems, which enable easy control over e.g. Internet-of- Things (IoT) smart home appliances, such as the Google Home and Amazon Echo assistants. Online recommendation engines that suggest which products, movies, etc. you may be interested in, are also based on DL.

DL uses multi-layered (deep) artificial neural networks (ANNs) that automatically learn features and representations from (raw) data, and can deliver higher predictive accuracy when trained on larger datasets. The ecosystem of DL frameworks is fast evolving, as well as the DL architectures that are shown to perform well on specialized tasks. For image classification, several recent architectures (models) are common:

Training these models is a computationally intensive task, especially on well known large data sets such as ImageNet and OpenImages v4. To accelerate the training process, many high profile DL frameworks use GPU accelerators, with distributed training also becoming common.

This project will focus on evaluating and comparing the following reference frameworks currently available, i.e.:

  • TensorFlow, an open source software library from Google for numerical computation using data flow graphs, thus close to the Deep Learning book way of thinking about neural networks. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization.
  • PyTorch, a Python version of Torch, which currently implements distributed learning through the torch.distributed library, with support for Gloo, NCCL, and MPI.
  • MXNet, Apache library with a large set of interfaces (C++, Python, Julia, Matlab, JavaScript, Go, R, Scala, Perl) and focus on speed and scalability.
  • Keras, a high-level neural networks API, written in Python and capable of running on top of TensorFlow.
  • Horovod, a distributed training framework for TensorFlow, Keras, and PyTorch from Uber Engineering.

The purpose of the project is to compare the efficiency of different implementations across several hardware architectures (CPU and GPU), taking into account a distributed configuration. Special care will need to be taken to perform the experiments in close to ideal situations such that the results are not influenced by outside factors (other user’s I/O activity or GPU utilization).

As part of the project, a facility wrapper will need to be designed to facilitate the integration and usage of these Deep Learning frameworks through the Slurm Workload Manager used on HPC facilities featuring hybrid architectures with NVIDIA  V100 GPU accelerators.

Brain-shaped representation of an artificial neural network, showing interlinked neurons depicting distributed computing elements. Visual material license: free for use, no attribution required, source:

Project Mentor: Sebastien Varrette, PhD

Project Co-mentor: Valentin Plugaru

Site Co-ordinator: Prof. Pascal Bouvry

Participant: Sean Mahon

Learning Outcomes:
The student will learn to design and analyse the scalability of parallel and distributed Deep Learning methods using a set of reference frameworks.

The student will learn how to implement these methods on modern computer architectures with latest NVIDIA V100 GPU accelerators.

Student Prerequisites (compulsory):
Knowledge of parallel algorithm design, CUDA, and Python.

Student Prerequisites (desirable):
Advanced knowledge of Python. Basic knowledge of Deep Learning frameworks such as Keras, and more generally of HPC tools. Experience on Slurm-based scheduler.

Training Materials:


  • Week 1/: Training week
  • Week 2/:  Literature Review Preliminary Report (Plan writing)
  • Week 3 – 7/: Project Development
  • Week 8/: Final Report write-up

Final Product Description:
The final product will be a performance analysis report as well as an implementation wrapper facilitating the integration and usage of the different frameworks on top of Slurm-based HPC facilities featuring hybrid architectures with NVIDIA  V100 GPU accelerators.  Ideally, we would like to publish the results in a paper on a conference or a workshop.

Adapting the Project: Increasing the Difficulty:
Student contribution to the new Deep500 effort for HPC Deep learning benchmarking.

Adapting the Project: Decreasing the Difficulty:
We could restrict the number of analysed frameworks. Indeed the project involves a large number of a representative set of deep learning frameworks.

The student will need access to a machine with NVIDIA V100 GPU accelerators, standard computing resources (laptop, internet connection) as well as, if needed, an account on Iris supercomputer of the University of Luxembourg.

University of Luxembourg

Tagged with: , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.