How best to distribute work? Learning GPU programming via Lasagna
It’s been just over a month of working on my project and I’ve just about got my head around the concepts of GPU programming using CUDA. In this post, I will try to explain some concepts of my work using everyone’s favorite analogy – FOOD!!! Let’s get into it then.
CPU vs GPU: What’s the difference?
Imagine that you are the head chef in a 5-star Italian restaurant, which serves the best Lasagna on the planet. Of course, the Lasagna is going to be the most ordered dish in your restaurant. And due to the soaring popularity of your restaurant, you begin to receive a crazy amount of orders (say around 1000 Lasagnas) every hour. You could sweat it out and make each Lasagna, one by one, baking each of them to perfection. However, at the end of one week, you’re absolutely stressed out and are not able to handle this volume of orders everyday. Now, this is where your manager comes and helps you out, telling you that you can recruit 100 junior chefs to help you to get those orders out faster. However, the catch is that, the junior chefs are not ‘that’ well trained to make decisions on their own. You can however give them instructions and they will follow them to a T (sometimes even faster than yourself!).
This is the difference between a CPU and GPU. You are similar to a CPU. You can operate independently using your logic and get work done, but only one order at a time. The group of 100 junior chefs you’ve recruited, that’s how a GPU works. There’s not much independent logic and everyone in the group does the same work that they’ve been instructed to do. But given the right instruction and facilities, you can get a whole lot more work done using the group. Now, let’s see how the actual work can be distributed among our new recruits.
Making Lasagna vs Moving Plasma Particles
The entire Lasagna recipe can be roughly broken down into three important stages as can be seen above. Let’s try to distribute the work in the three stages.
Let’s assume that this stage involve cutting, sauteeing the vegetables, meat and preparing the sauce. So before you can give instructions to your army of chefs, you need to assemble all the chefs together to explain exactly what needs to be done. Also, you need to distribute the ingredients equally among each of the 100 chefs. Once the junior chefs have the ingredients and the instructions, they can each go to their individual workstations and get to work!!!
Is this the best way to go ahead? We need to see if the time taken for you to work alone is lesser or more than the time taken to assemble the chefs, distribute the ingredients and then for the chefs to finish. We can safely assume that distributing the cooking among 100 chefs would easily take less time than one person working on 100 dishes despite the extra time required for assembling the chefs and then distributing the ingredients. So yes, you can breath easy now. But wait, your manager tells you that the number of orders are doubling every day!!!! Now, the amount of ingredients you need to distribute becomes humungous and this alone takes a lot more time than before.
One of the reasons you have been made head chef is that you can come up with solutions to such problems, and EUREKA!!! You have come up with a very simple solution to it all!!! Why not just give each of the chefs the list of ingredients and ask them to get the ingredients on their own. Then, you remove the distribution step completely and they can start cooking immediately. YOU GENIUS!!!!!!
How does this compare with the particle in cell (PIC) code I’m working on? When we need to call the GPU to do some work, we need to first call the special code (instructions) we’ve written for the GPU and then transfer all the data for each ‘worker’ in the GPU to work with (distribute ingredients) from the CPU. These codes run with a huge amount of particles (>1000000000 particles!!!!) and you can imagine that transferring this data would take a lot of time relatively. To solve this problem, we simply create these particles in the GPU itself (similar to how each junior chef gets their own ingredients).
Now the lasagna needs to be baked. However, the problem is that the restaurant is still using the one oven which you have in your workstation. So each of the half-finished lasagnas need to be brought to you first. But luckily the oven you have is absolutely high-end and efficient and you are still able to manage the large amount of orders!!! PHEW!!! But afraid that the number of orders might increase further, you approach the manager asking if we can afford individual ovens for each of the junior chefs’ workstations. She says that we could do it, but the ovens wouldn’t be as good as the current one. You’re now stuck in a dilemma! Is it worth investing in 100 individual slower ovens or do we continue with the extremely fast oven we still use? You also understand that it depends on the number of orders you receive. If that keeps increasing, it may be a good idea to invest in this. If the orders stay the same or decrease, you could still continue working as before without further investment.
Again, let’s compare this with our PIC code. The code for this step currently has a very serial (can be done only in a certain sequence) but extremely efficient and fast algorithm to solve for the fields. However, you can try to change the algorithm completely in such a way that each ‘worker’ in the GPU can work in parallel. It might not be the most efficient algorithm, but it could be faster than the initial version if the number of particles we have is bigger. Contrarily, you’ll need to give extra instructions to the GPU which also takes some more time.
Now that the lasagna is ready, you need to make it a 5-star dish. And since you’re the head chef, only you can add the finishing touches and check if everything is right with each lasagna. Hence, if each of your junior chefs has his/her own dish, you’ll need all of them to assemble and go through each of them one by one. Unfortunately, this can only be done by you and hence, this step cannot be avoided.
Similarly, in a PIC code, all the calculated results in the GPU need to be transferred to the CPU so that the required data can be extracted and presented to the user in a cool, visual manner. Unfortunately, this can be done only on the CPU for now.
Throughout the last month, me and my teammates Victor and Paddy have tried out all these different methods of work distribution between the CPU and GPU and checked which ones give us the best results. ( If you have more novel ideas for similar work distribution, please mention them in the comments below!!! ) We’ve managed to get some exciting results which we will present at the end of this month. Stay tuned for this!!!! All this talk about food has got me craving for some authentic Italian food. I’m off to satisfy my hunger along with a cold beer to beat the summer. Till then , Ciao Adios!!!!