Got your ducks in a row? GPU performance will show!
Project reference: 2012
Today’s supercomputing hardware provides a tremendous amount of floating point operations (FLOPs). While CPUs are designed to minimize the latency of a stream of individual operations, GPUs try to maximize the throughput. However, GPU FLOPs can only be harvested easily, if the algorithm does exhibit lots of independent data parallelism. Hierarchical algorithms like the Fast Multipole Method (FMM) inhibit
the utilization of all available FLOPs on GPUs due to their inherent data dependencies and only little independent data parallelism.
Is it possible to circumvent these problems?
In this project we turn our efforts towards a fully taskified FMM for GPUs. Depending on your scientific background we will pursue different goals. First, the already available GPU tasking framework needs to be coupled with the full FMM “shift” and “translation” operators to enable a larger range of accuracies. Second, a special reformulation of the mathematical FMM operators can be implemented and tested to increase the efficiency of such hierarchical methods for GPUs even further.
The challenge of both assignments is to execute tiny to medium-size compute kernels without large overheads within the tasking framework. This also ensures portability between different generations/designs of modern GPUs.
What is the fast multipole method? The FMM is a Coulomb solver and allows to compute long-range forces arising in molecular dynamics, plasma or astrophysics. A straightforward approach is limited to small particle numbers N due to the O(N^2) scaling. Fast summation methods such as PME, multigrid or the FMM are capable of reducing the algorithmic complexity to O(N log N) or even O(N). However, each fast summation method has auxiliary parameters, data structures and memory requirements which need to be provided. The layout and implementation of such algorithms on modern hardware strongly depends on the available features of the underlying architecture.
Project Mentor: Ivo Kabadshow
Project Co-mentor: Laura Morgenstern
Site Co-ordinator: Ivo Kabadshow
The student will familiarize himself with current state-of-the art GPUs (Nvidia P100/V100). He/she will learn how the GPU/accelerator functions on a low level and use this knowledge to utilize/extend a tasking framework for GPUs in a modern C++ code-base. He/she will use state-of-the art benchmarking/profiling tools to test and improve performance for the tasking framework and its compute kernels which are time-critical in the application. Special emphasis will be placed on the efficient design of datastructures and access pattern. Different mathematical representations of the “same thing” lead to substantial differences in performance on different hardware.
Student Prerequisites (compulsory):
- Programming knowledge for at least 5 years in C++
- Basic understanding of template metaprogramming
- “Extra-mile” mentality
Student Prerequisites (desirable):
- CUDA or general GPU knowledge desirable, but not required
- C++ template metaprogramming
- Interest in C++11/14/17 features
- Interest in low-level performance optimizations
- Ideally student of computer science, mathematics, but not required
- Basic knowledge on benchmarking, numerical methods
- Mild coffee addiction
- Basic knowledge of git, LaTeX, TikZ
Just send an email … training material strongly depends on your personal level of knowledge. We can provide early access to the GPU cluster as well as technical reports from former students on the topic. If you feel unsure about the requirements, but do like the project, send an email to the mentor and ask for a small programming exercise.
Week – Work package
- Training and introduction to FMMs and GPU hardware
- Benchmarking of current FMM operators on the CPU
- Adding basic FMM operators to the GPU tasking framework
- Extending FMM operators to support multiple compute kernels
- Adding reformulated (more compact) FMM operators
- Performance tuning of the GPU code
- Optimization and benchmarking, documentation
Generating of final performance results. Preparation of plots/figures. Submission of results.
Final Product Description
The final result will be a taskified FMM code with CUDA to support GPUs. The benchmarking results, especially the gain in performance can be easily illustrated in appropriate figures, as is routinely done by PRACE and HPC vendors. Such plots could be used by PRACE.
Adapting the Project: Increasing the Difficulty:
The tasking framework uses different compute kernels. For example it may or may not be required to provide support for a certain FMM operator. A particularly able student may also apply the GPU tasking to multiple compute kernels. Depending on the knowledge level, a larger number of access/storage strategies can be implemented or performance optimization within CUDA can be intensified.
Adapting the Project: Decreasing the Difficulty:
As explained above, a student that finds the task of adapting/optimizing the FMM operators to all compute kernels too challenging, could very well restrict himself to a simpler model or partial set of FMM operators.
The student will have his own desk in an air-conditioned open-plan office (12 desks in total) or in a separate office (2-3 desks in total), will get access (and computation time) on the required HPC resources for the project and have his own workplace with a fully equipped workstation for the time of the program. A range of performance and benchmarking tools are available on site and can be used within the project. No further resources are required. Hint: We do have experts on all advanced topics, e.g. C++11/14/17, CUDA in house. Hence, the student will be supported when battling with ‘bleeding-edge’ technology.
Jülich Supercomputing Centre