Time to speed things up!

Today I will tell you how to speed up a programme running on a GPU. Do you remember the accumulation tree example from my last post?
I was provided with a working version of it, with a global queue to stores tasks ready to be launched.

The protagonists of our story are the following. The threads are our roman soldiers. They are working on a Floating Point Processing Unit (FPU) and they do the computation. They are grouped in units of 32 threads called warps. Threads within the same warp should execute the same instruction at the same time, but on different data. Warps are themselves grouped into blocks. Threads within the same block can communicate fast through shared memory and they can synchronize together. However, no inter-block communication or synchronization is allowed except through the global memory which is much slower.

The tree is another player. He lives in the global memory since all threads should be able to access it. At last there is the queue, also living in the global memory. It is implemented by a chained list, meaning that its size can adapt to the number of tasks it contains. To avoid race conditions, access to the tree and the queue are protected with a mutual exclusion system (mutex). We rely on a specific implementation of a mutex for GPU. It allows only one thread to access the data protected by the mutex at a time. To avoid deadlocks (a kind of infinite loop when two threads wait for an event that will never occur), only one thread per warp tries to enter the mutex. We call it the warp master, and in this early version, only the warp masters can work, as follows:

Fetch a task from the queue
Synchronisation of the block
Execute the task
Synchronisation of the block
Push new tasks (if any)

Step 5 is done only if the task done at step 3 resolves a dependency and creates a new task.

Cutting allocations

The first improvement was to change the queue implementation with my own relying on a fixed sized queue. The reason behind this is that memory allocation is very expensive on a GPU, and with adaptive size queue you are allocating and freeing memory each time a thread pushes and pops a task.

Reestablishing private property

The second idea was to reduce the contention on the global queue since working threads are always trying to access it.
I added small private queues for each block, that can be stored either in the cache or in the share memory for fast access.
The threads within a block use the block’s private queue in priority, and the global queue as a fallback if the private one is full (push) or empty (pop).

Solving the unemployment crisis

For now only a few threads (the warp masters) are working. It’s time to put an end to that. First, since threads within a block are synchronised, access to the private queue is done at the same time thus performed sequentially because of the mutex. I decided that only one thread per block will access the queue, and will be in charge to fetch the work (step 1) and solve dependencies (step 5) for all the threads.

Breaking down the wall

Now that all threads are working, they are all waiting to enter the mutex protecting the tree, even if they are trying to access different parts of it. So I removed the mutex, and ensured that all operations on the tree are atomic. That’s a bit like if there was a mutex on each node in the tree, making it possible to several threads to access the tree concurrently.

Fighting inequality

Removing the mutex from the tree resulted in a huge gain on time, so I tried to also get rid of it for the shared queue access. First I did split the queue so that there is one shared queue for each block.
Because the queue is fixed in size, the operations of pushing and popping a task are independent. One block master can pop only from its own queues (private and shared) and can push to its own private queue, its own shared queue and also other block’s shared queue.
Thus, we do not need mutex protection for the popping operation.

Last but not least, the work must be equally shared between blocks (that is called load balancing). I provided a function that tells a calling thread in which shared queue a newly created task should be pushed.

How much is the speedup ?

Quite high actually. The optimized version I wrote works 450 times faster than the version I was provided. For an octree of depth 5, the execution time is reduced from 700 ms to 1.5 ms, enabling it to be used in real applications.