How to know if your Deep Learning Algorithm actually learned anything

Hello from Luxembourg. If you read my previous post, you may be aware that this is where I’ll be spending my summer as part of the PRACE Summer of HPC programme. The official title for the project I’ll be doing here (at the University of Luxembourg to be precise) is “Performance analysis of Distributed and Scalable Deep Learning,” and I’ll spend this blog post trying to give some sort of explanation as to what that actually involves. However, I’m not going to start by explaining what Deep Learning is. A lot of other people, such as the one behind this article, who know more than me about it have already done an excellent job of it and I don’t want this post to be crazily long.
The Scalable and Distributed part, however, is slightly shorter to explain. As many common models in Deep Learning contains hundreds of thousands, if not millions of parameters which somehow have to be ‘learned’, it is becoming more common to spread this task over a number of processors. A simple and widely used example of this is splitting up the ‘batch’ of data points needed for each training step over a number of processors averaging the relevant results over all processors to work out change needed in each trainable parameter at the end. That last part can cause problems however as, like every other area of HPC, synchronization is expensive and should be avoided if possible. To make things more complicated, a lot of deep learning calculations are very well suited to running on GPUs instead of CPUs, which may add more layers of communication between different devices to the problem.
Despite being a fairly computationally expensive task Deep Learning seems to be quite a popular way of solving problems these days. As a result, several organisations have come up with and published their own, allegedly highly optimized, frameworks to allow programmers to use these techniques with minimal effort. Examples include Tensorflow, Keras, MXNet, PyTorch and Horovod (see logos above). As these libraries all use their own clever tricks to make the code run faster, it would be nice to have a way of working out which is the most suitable for your needs, especially if your needs involve lots of time on an expensive supercomputer.

This takes us to the surprisingly complicated world of Deep Learning Benchmarking. It’s not entirely obvious how evaluate how efficiently a deep learning program is working. If you want your code to run on a large number of processors, the sensible thing to do is to tweak some hyperparameters so you don’t have to send the vast numbers of parameters mentioned in the previous paragraph between devices very often. However, while this can make the average walltime per data point scale very well, there’s no guarantee the rate at which the model ‘learns’ will improve as quickly. As a result, there are multiple organisations who have come up with a variety of ways of benchmarking these algorithms. This includes MLPerf, which publishes results “time-to-accuracy” results for some very specific tasks on a variety of types of hardware, DeepBench, which evaluates the accuracy and speed of common basic functions used in Deep Learning across many frameworks, and Deep500, which can give a wide range of metrics for a variety of tasks, including user defined ones, for PyTorch and Tensorflow. There are even plans to expand some these to include how efficiently the programs in question use resources like CPU/GPU power, memory, and energy efficiency.
My project for the summer is to set up a program which will allow me to run experiments comparing the efficiency of different frameworks on the University’s Iris cluster (see picture) and to help cluster users choose the most appropriate setup for their task. Ideally, the final product will allow you to submit a basic single device version of a model in one of a few reference frameworks and which would then be trained in a distributed manner for a short time, with some metrics of the kind described above being output at the end. So far, I’m in the final stages of getting a basic version up and running which can profile run with image classification problems Tensorflow, distributed using Horovod. The next few weeks, assuming nothing goes horribly wrong, will be spent adding support for more Frameworks, collecting more complicated metrics, running experiments for sample networks and datasets, and making everything a bit more user friendly. Hopefully, the final version will help users ensure that appropriate framework/architecture is being used and identify any performance bottlenecks before submitting massive individual jobs.

I’m doing this work from a town called Belval, near the border with France. While the university campus where I work is the main place of interest here, it’s only a short train ride from Luxembourg city, with various towns in nearby France, Germany and Belgium looking like good potential day trips. From my visits so far, the most notable feature of the city centre is the relatively large number of steep hills and even cliffs (one of which requires a lift to get down). This makes walking around the place a bit slow but at least it means I’m getting some exercise. The one low point of my time here was when heat got a bit excessive in the last week. Irish people are not meant to have to endure anything over 35° and I’m not unhappy that I probably won’t have to again this summer. However, there was a great sense of camaraderie in the office, where the various fans and other elaborate damage limitation mechanisms people had set up were struggling to cope. I imagine it would have been a lot less tolerable if the people around me weren’t so friendly, helpful and, welcoming.
Leave a Reply