Good-bye or Taskify!

Project reference: 1918
Today’s supercomputing hardware provides a tremendous amount of floating point operations (FLOPs). While CPUs are designed to minimize the latency of a stream of individual operations, GPUs try to maximize the throughput. However, GPU FLOPs can only be harvested easily, if the algorithm does exhibit lots of independent data parallelism. Hierarchical algorithms like the Fast Multipole Method (FMM) inhibit the utilization of all available FLOPs on GPUs due to their inherent data dependencies and only little independent data parallelism.
Is it possible to circumvent these problems?
In this project, we turn our efforts towards a fully taskified FMM for GPUs. Depending on your interest, we will pursue different goals. First, the already available and taskified CPU version of the FMM can be adapted to support basic tasking on the GPU and speed up the execution. Second, a special hierarchical extension of our GPU tasking queue can be implemented and tested to achieve even higher performance.
The challenge of both assignments is to execute our small compute kernels without large overheads within the tasking framework. This also ensures portability between different generations/designs of modern GPUs.
What is the fast multipole method? The FMM is a Coulomb solver and allows to compute long-range forces for molecular dynamics, e.g. GROMACS. A straightforward approach is limited to small particle numbers N due to the O(N^2) scaling. Fast summation methods such as PME, multigrid or the FMM are capable of reducing the algorithmic complexity to O(N log N) or even O(N). However, each fast summation method has auxiliary parameters, data structures and memory requirements which need to be provided. The layout and implementation of such algorithms on modern hardware strongly depend on the available features of the underlying architecture.

The assumed workplace of a 2019 PRACE student at JSC.
Project Mentor: Andreas Beckmann
Project Co-mentor: Ivo Kabadshow
Site Co-ordinator: Ivo Kabadshow
Participant: Noë Brucy-Ciaramella
Learning Outcomes:
The student will familiarize himself with current state-of-the-art GPUs (Nvidia V100/AMD Vega 64). He/she will learn how the GPU/accelerator functions on a low level and use this knowledge to utilize/extend a tasking framework for GPUs in a modern C++ code-base. He/she will use state-of-the-art benchmarking/profiling tools to test and improve performance for the tasking framework and its compute kernels which are time-critical in the application.
Student Prerequisites (compulsory):
Prerequisites
- Programming knowledge for at least 5 years in C++
- Basic understanding of template metaprogramming
- “Extra-mile” mentality
Student Prerequisites (desirable):
- CUDA or general GPU knowledge desirable, but not required
- C++ template metaprogramming
- Interest in C++11/14/17 features
- Interest in low-level performance optimizations
- Ideally, the student of computer science, mathematics, but not required
- Basic knowledge of benchmarking, numerical methods
- Mild coffee addiction
- Basic knowledge of git, LaTeX, TikZ
Training Materials:
Just send an email … training material strongly depends on your personal level of knowledge. We can provide early access to the GPU cluster as well as technical reports from former students on the topic. If you feel unsure about the requirements, but do like the project, send an email to the mentor and ask for a small programming exercise.
Workplan:
Week – Work package
- Training and introduction to FMMs and GPU hardware
- Benchmarking of current tasking variants on the CPU
- Adding basic queues in the tasking framework
- Extending basic queues to support different compute kernels
- Adding hierarchical queues in the tasking framework
- Performance tuning of the GPU tasking
- Optimization and benchmarking, documentation
- Generating of final performance results. Preparation of plots/figures. Submission of results.
Final Product Description:
The final result will be a taskified FMM code with CUDA or SYCL to support GPUs. The benchmarking results, especially the gain in performance can be easily illustrated in appropriate figures, as is routinely done by PRACE and HPC vendors. Such plots could be used by PRACE.
Adapting the Project: Increasing the Difficulty:
The tasking framework uses different compute kernels. For example, it may or may not be required to provide support for a certain FMM operator. A particularly able student may also apply the GPU tasking to multiple compute kernels. Depending on the knowledge level, a larger number of access/storage strategies can be ported/extended or performance optimization within CUDA/SYCL can be intensified.
Adapting the Project: Decreasing the Difficulty:
As explained above, a student that finds the task of adapting/optimizing the tasking to all compute kernels too challenging could very well restrict himself to a simpler model or partial tasking.
Resources:
The student will have his own desk in an air-conditioned open-plan office (12 desks in total) or in a separate office (2-3 desks in total), will get access (and computation time) on the required HPC resources for the project and have his own workplace with a fully equipped workstation for the time of the program. A range of performance and benchmarking tools are available on site and can be used within the project. No further resources are required. Hint: We do have experts on all advanced topics, e.g. C++11/14/17, CUDA in house. Hence, the student will be supported when battling with ‘bleeding-edge’ technology.
Organisation:
Jülich Supercomputing Centre
Leave a Reply