Mmh! Now THIS is a Tasty Kernel!

Like all entropy-generating processes, the Summer of HPC must come to an end. However, you won’t get rid of me that easily, since there’s still one more blog post left to be done! This time I’ll tell you more about the HIP framework, some benchmark results of the code and how the Summer of HPC program looks in retrospect.

Figure 0: Me, when I notice a delicious-looking CUDA GPU kernel, ripe for translation to HIP.
HIP, or Heterogeneous-Compute Interface for Portability, enables writing General-Purpose GPU code for both AMD’s and Nvidia’s computation architectures at once. This is beneficial for many reasons. Firstly, you’re not bound to one graphics card manufacturer with your code, so you may freely select the one which suits your needs the most. Secondly, your program can have much more users if they can run it on whatever GPU they have. And thirdly, managing two different code bases is very time-consuming and tedious, and as the common saying in high-performance computing goes, your time is worth more than the CPU time.
The HIP code is a layer of abstraction on top of the CUDA or HCC code, which are the manufacturer-specific interfaces for writing GPGPU code. HIP code thus entails a few properties to the framework. The HIP compiler, hipcc, converts the source code into CUDA or HCC code during compilation time and compiles it with their own respective compilers. This may result into the framework only using features which both interfaces have in common, and thus it is not so powerful as either of them separately. However, it is possible to incorporate some hardware-specific features into the code easily, but the code needs to be more managed.
There’s also a script included in the HIP framework, which converts code to the opposite direction: from CUDA C/C++ to HIP C++. This is a very neat program, because lots of existing GPU code is written in CUDA, which is considered the current “lingua franca” of GPGPU. The script allows to make code much more portable with near-zero effort, at least in theory. I tried it out in my project, but with thin results. The HIP framework is currently under heavy development, so hopefully this is something which we’ll see in the future.
No fancy translation scripts are needed, though. The HIP framework’s design and syntax are very natural to someone coming from CUDA. So if you’re familiar with it, you might as well write the whole code for yourself. If you’re not familiar with GPGPU programming, then you can just follow the CUDA tutorials and just translate the commands to HIP. It’s that easy. To illustrate this, I created a comic (figure 1) of the two toughest guys of the Pulp Fiction movie, but instead of being hired killers, they’re HPC experts.

Figure 1: A discrete time model simulating human communication (also known as: comic). Simulation in question concerns a hypothetical scenario, where characters Vincent Vega and Jules Winnfield from Quentin Tarantino’s motion picture Pulp Fiction had made different career choices.
In high-performance computing, timing is everything, so I did extensive benchmarking of my program. The results are shown in figure 2. The top graph shows the comparison of the runtimes of the program on Nvidia’s Tesla K40m and AMD’s Radeon R9 Fury graphics cards with single and double precision floating point numbers. The behaviour is quite expected: the single precision floating point calculations are faster than the double precision ones. There’s a curious doubling of the runtime on the Tesla at around p = 32. This comes from the fact that 32 is the maximum amount of threads to be running in Single-Instruction-Multiple-Data fashion on CUDA. This bunch of threads is called a warp and their effects could be taken into account with different sorts of implementations.
The bottom graph shows a different approach to the problem where a comparison is made between the usual approach and an iterative approach on the Tesla card. In the usual approach, the operator matrices are stored in the global GPU memory and each thread reads them out of there. In the iterative approach, the operator matrices are computed at runtime on each of the threads. This results in more computation, but less reading from the memory. Surprisingly, the iterative approach using single precision floating point numbers was faster than the other approaches when the multipole amount was large.

Figure 2: Runtime graphs for my GPU program. The computation was done for 4096 multipole expansions in parallel.
During this study programme, I met lots of awesome people from all around the world, researchers and students alike. Not only did I meet the people affiliated with the Summer of HPC, but many scientists and Master’s or Ph.D. students working at the Jülich Supercomputing Centre and, of course, the wonderful guest students of JSC Guest Student Programme. My skill repertoire has also expanded tremendously with, for example, version management software git, LaTeX-based graphics TikZ, parallel computing libraries MPI and OpenMP, creative writing and video directing. All in all, this summer has been an extremely rewarding experience: I made very good friends, got to travel around Europe and learned a lot. Totally an unforgettable experience!

Figure 3: Horse-riding supercomputing barbarians from Jülich ransacking and plundering the hapless local town of Aachen. Picture taken by awesomely barbaric Julius.
After all this, you could ask me which brand I recommend between Nvidia and AMD, or which GPU I would prefer, that is, which GPU would be mine.
My GPU will be the one which says Bad Motherlover. That’s it, that’s my Bad Motherlover.
Leave a Reply