Hybrid-parallel Convolutional Neural Network training
Project reference: 1820
There is a lot of research around scaling out convolutional neural networks to a large number of compute nodes, as the computational requirements when training complex networks on large-scale datasets become prohibitive. However, most if not all of these work employ data-parallel training techniques, where a batch of image samples is evenly split across the number of workers. Then each worker independently processes in the forward propagation stage, with gradient communication among workers being performed in the backward propagation pass. Although this technique proved quite successful, it has its own drawbacks. One of the most important is that scaling out to a very large number of nodes implies increasing the batch size, and this leads to more difficulty in the SGD optimization. The second drawback is that data-parallel training works only if the model fits in memory. However, when using model parallelism, there is much more communication involved – particularly in the forward propagation pass. As part of our research as an Intel Parallel Computing Center, we have done quite some research on deep neural network training and particularly on data-parallel scaling here, presented here: https://arxiv.org/abs/1711.04291.
For performing our data parallel research, we have used the Intel’s fork of the Caffe framework in combination with Intel Machine Learning Scaling Library (ML-SL), and managed to scale the training of a single neural networks to up to 1536 Knights Landing compute nodes. We propose to make use of the same software (and hardware) infrastructure, as it allows for model parallelism as well. Then, we envision the hybrid model in the following fashion: in a multi-socket system or in a Knights Landing system configured with sub-NUMA clustering, each separate NUMA domain will work on training part of the model – thus employing model parallelism within the computing node. When going across compute nodes, we will integrate our data parallel approach, as the interconnect (Infiniband,OPA) usually does not support the communication requirements of model parallelism. This has the potential to lower the total batch size, while also increasing the throughput achieved per node.
The student is expected to make use of the functionality already present in MLSL for model parallelism, and to evaluate several schemes of the hybrid approach (model-parallel within node, data-parallel across nodes). Profiling will be necessary in order to maximize the intra-node bandwidth. The techniques will be tested on both Intel Skylake clusters and Intel Knights Landing clusters.
Project Mentor: Valeriu Codreanu
Project Co-mentor: Damian Podareanu
Site Co-ordinator: Zheng Meyer-Zhao
The student will learn how to perform large-scale neural network training, how to balance the trade-offs between data and model-parallel training, as well as how to profile and optimize code running on state-of-the-art hardware platforms such as the Intel Skylake and Knights Landning architectures
Student Prerequisites (compulsory):
- Basic Knowledge of Machine Learning, particularly Convolutional Neural Networks
- Knowledge of C++/MPI
Student Prerequisites (desirable):
Some skills in being able to develop mixed code such MPI/OpenMP will be an advantage.
- Some general materials on C/C++ and Python from the web should be read to be up to date prior to arrival.
- Machine-learning wise, a brief read through this book would be great: http://www.deeplearningbook.org/
- Week 1/: Training week
- Week 2/: Literature Review Preliminary Report (Plan writing)
- Week 3 – 7/: Project Development
- Week 8/: Final Report write-up
Adapting the Project: Increasing the Difficulty:
In order to increase the project difficulty, one can think of adapting this hybrid parallelism approach to clusters of multi-GPU servers.
Adapting the Project: Decreasing the Difficulty
The topic will be researched and the final product will be designed in full but some of the features may not be developed to ensure working product with some limited features at the end of the project (e.g. excluding the Knights Landing architecture).
The student will need access to a cluster with Intel Skylake and Intel Knights Landing systems (provided by us), standard computing resources (laptop) as well as an account on the Cartesius supercomputer (provided by us).