My task : tasking on GPUs

My task : tasking on GPUs

Is you graphics card able to run N-body simulations in a smart way? A complex tree algorithm, a sophisticated tasking system, is all that a task for a GPU? No, some will say, a graphics card can do only basic linear algebra operations. Well, maybe the hardware is capable of much more…

It is now time to give you some insight on what is my project. The main goal is to make progress towards the implementation on a smart algorithm to solve the N-body problem. To known the main idea behind this algorithm (without going into all the dirty details), you can check the video I made for PRACE there.

To put it in a nutshell, the Fast Multipole Methode (FMM) is a algorithm to compute long-range forces within a set of particles. It enables to solve numerically the N-body problem with a linear complexity, where it would otherwise be quadratic (when computing the interactions between each pair of particles). Doubling the number of particles only doubles the computation time instead of quadrupling it. However the FMM algorithm is hard to parallelize because of data dependencies. Tasking, meanings splitting the work into tasks and putting them into a queue helps a lot to give work to all threads. A working tasking framework for the FMM has been implemented on Central Processing Units (CPU) by D. Haensel (2018). Will such a tasking framework run efficiently on General Purpose Graphical Processing Units (GPGPUs or more simply GPUs)?

The answer is not obvious at all, because CPU and GPU have a very different architectures. My goal this summer was to shed some light on that topic.

A smooth start

Let me first introduce the problem that we will use to test the tasking framework. It is a simplified version of one of the operators of the FMM, and we named it the accumulation tree.

The principle is very simple: the content of each cell is added to its parent. So one task is exactly “adding the content of a cell to its parent”. You already see that dependencies will appear since a node needs the result from all children before having its tasks launched. Imagine we have 10 processing units (GPU or CPU threads), the computation of the accumulation will be as follows.

Initialisation

All leaves are initialized with 1. All tasks that are ready, that is all tasks from leaves, are pushed into the queue.

Round 1

Ten blue tasks are done. Two new green tasks are ready; so they are pushed into the queue.

Round 2

The last six remaining blue tasks are done as well as two green tasks. The last two green tasks are ready and pushed into the queue. Here we can see that the tasking permits to maximize the threads usage since all tasks that are ready can be executed.

Round 3

The last two green tasks are executed. We get the correct result, hip hip hooray!

And on GPUs ?

Such a tasking system works well on a CPU. Why can’t we just copy-paste the code, translate some instructions and add some annotation to make it work on a GPU?

Because many assumptions we can rely on CPUs do not hold anymore on GPUs. The biggest of them is thread independence. You can compare the CPU to a barbaric army: only few strong soldiers, any of them being able to act individually.

credits: Wikimedia CC-BY-SA

A graphics card however is more like the roman army, with a lot of men divided into units. All soldiers within the same units are bound to do exactly the same thing (but on different targets).

credits: https://patricklarkinthrillers.files.wordpress.com

Even if I’m sure you are looking forward knowing if it is possible to mate this powerful army to implement a tasking system, you will have to wait for my next blog post. Be well in the meantime!

Tagged with: , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.