Trying to fit huge model on a small device

Hello again! I am halfway through my Summer of HPC adventure and would want to update you on what I’ve been doing during last few weeks. If you read my introductory post (if you didn’t, I encourage you to do so), you already know I am working on project focused on deploying machine learning applications on edge devices.

What are edge devices?

There are multiple different (but similar) definitions of the edge devices but the one I am using is that this is device which is physically located close to the end user, so it can be laptop, tablet, smartphone, single-board computer like Arduino or even simple sensor like hygrometer which is capable of saving and sending measured data. In our project, we aim to deploy machine learning application on an average smartphone.

What are deep neural networks?

Machine learning model we chose to deploy is a deep neural network (DNN). Neural network is a model which is loosely inspired by human brain. It consists of multiple layers of nodes (called neurons) which are connected with each other. Adjective “deep” means that network consists of multiple layers (usually five or even more). Each neuron in each layer has weights, which are trainable parameters representing the knowledge that neuron gained from data during training process. Picture above represents simplified schema of shallow neural network. Schema of DNN wouldn’t be much different, it would just have more layers in hidden layer(s) part.

Deep neural networks are nowadays the most popular and usually the best performing machine learning models. They aren’t ideal though. Some of the cons of neural networks are lack of interpretability, very long training time, long inference time and very often huge size of files containing pre-trained network. Latter two cons are the biggest problems when it comes to deploying neural networks on edge devices.

Why do we want to deploy machine learning apps on edge devices?

Theoretically, we could leave pre-trained models on HPC clusters or cloud servers but practically it’s often not a good choice. For example, if we want to pass sensitive data to the model deployed in cloud, we would have to send it through the internet what always causes security concerns and risk of sensitive data leak. Moreover, storing the model on edge device allows to use it even when we don’t have internet connection available. Even if internet connection is available, local inference avoids latency, which is inherent part of communicating with server through the internet.

So, what can we do to deploy DNN on the edge?

There are multiple techniques allowing to compress and optimize DNNs for the inference process. While during first three weeks I was focusing on the basics of training the DNNs and introduction to GPU and distributed training, lately I was trying and investigating one of the most popular ways to both compress the model, and make the inference faster – pruning. In its simplest form, model pruning sets specified fraction of DNN weights with smallest absolute value to 0. This makes perfect sense, as when we pass the data into the DNN, data is multiplied by neuron weights to get the output. Obviously, multiplying number by value (weight) almost equal to 0 yields result which is almost equal to 0, so by setting such a weights to 0, we are not losing much information and we don’t change our model too much. On the other hand, model which contains a lot of zeros is easier to compress and its inference process can be optimized – many multiplications by 0 are fast to perform as their result is always 0. Picture below shows the idea of pruning by visualization of model presented on picture above after pruning which set to 0 roughly 50% of neurons’ weights.

Below, you can see some charts from conducted experiments with pruning. First of all, I have to mention that I experimented with shallow model (less than 10 layers) and small dataset, as I couldn’t afford learning deep model on proper data as it would take a lot more time, and I wouldn’t manage to do much besides it. That’s why training of a single model took only around 11 minutes and unpruned model had size of around 10 MB. Plots below visualize change of compressed model size, model accuracy and training time with change of final model sparsity – final sparsity is the fraction of model weights set to 0, e.g. sparsity 0.75 means that 75% of the weights were set to 0.

I think that’s everything I wanted to write in this post! It became pretty long anyway (I tried to be as concise as possible but as you see a lot is happening), so I hope you didn’t get bored reading it and you reached this point. If you did, thank you for your time and stay tuned, as in upcoming weeks I will train deep model on big dataset using Power9 cluster and experiment with other methods of model compression & optimization!

Trying to fit huge model on a small device

What are edge devices?

What are deep neural networks?

Why do we want to deploy machine learning apps on edge devices?

So, what can we do to deploy DNN on the edge?

Participants 2022

Latest podcasts

Trying to fit huge model on a small device

What are edge devices?

What are deep neural networks?

Why do we want to deploy machine learning apps on edge devices?

So, what can we do to deploy DNN on the edge?

Participants 2022

Tag cloud

Latest podcasts