Concluding my experience with final thoughts and video

Concluding my experience with final thoughts and video

Hi everyone!
Welcome to my fourth and last blog post about my work on SoHPC 2020. Today, I will explain one last optimisation and then I will share the project’s video presentation with you. So let’s get started!

Optimised CUDA C version

In my last blog post, I presented a CUDA C program that launches a single cooperative kernel (function executed on the GPU) for all iterations to avoid the overhead of launching multiple kernels on the GPU. To achieve that, I needed to use the CUDA runtime launch API which provides synchronisation through every single GPU thread.

#include <cooperative_groups.h>
//launching a cooperative kernel
cudaLaunchCooperativeKernel(kernel, blocks, threads_per_block, args);

However, I found out that the API applies some limitations to the amount of GPU blocks and consequently to the number of threads. That means it can not launch one thread for each element of the matrix (at least for large matrices), which would be the ideal situation. So the only solution when the available threads are less than the elements of the array, is to assign multiple elements to each thread. But, of course, this increases the work that each thread has to do, and for large scale factors this causes a performance drop.

To further investigate this, I developed another CUDA C version, which does not use the above API and launches multiple smaller kernels per iteration.

//standard way to launch kernels on CUDA C
kernel<<<blocks, threads_per_block>>>(args);

After that, I used a profiler to see where this GPU program spends its time and noticed that the time for launching kernels is only a small portion of the total time.

For a large problem (scale factor = 192), the program spends merely 0.18% of the time to launch the kernels.
For a large problem (scale factor = 192), the program spends merely 0.18% of the time to launch the kernels.

Eventually, it came out that the overhead of launching multiple kernels instead of one is minor. Additionally, in this new CUDA C code there is no limit to the amount of GPU threads, meaning that we can launch one thread for each element, which explains why we get the most optimal performance.

Final performance graph, updated with the new version, C CUDA - separate kernels
Final performance graph, updated with the new version, C CUDA – separate kernels

Video and Conclusion

My partner, Alex, and I have prepared a video presentation, in which we describe our progress during the entire summer. You could write in the comments what you think about it.

Video presentation of the project

PRACE Summer of HPC 2020 was full of beautiful moments, but I think one of the most memorable ones is when our mentor told us in the end that our performance results went beyond expectation and that he was really satisfied with our work.

All in all, I am glad I participated in a programme that offered me creativity, knowledge and excitement and I am sure this will be an unforgettable experience for the years to come.

Please follow and like us:
error
Tagged with: , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.