A Fast Multipole Toolbox for GPU cluster

Project reference: 1516
Modern multi- and many-core hardware architectures provide a huge amount of floating point operations (FLOPs). To harvest these FLOPs efficiently a lot of parallelism in the examined algorithm needs to be uncovered and exploited. Performance boosts are well hidden in low-level hardware features of the CPUs or GPUs. The great diversity and short life cycle of todays HPC hardware does not allow for hand-written, well-optimized assembly kernels anymore. This begs the question if we can utilize the hidden performance from high-level languages with greater abstraction possibilities like C++.
In this project we focus the computation of long-range forces for molecular dynamics within GROMACS. A straightforward approach is limited to small particle numbers N due to the O(N^2) scaling. Fast summation methods like PME, multigrid or fast multipole methods are capable of reducing the algorithmic complexity to O(N log N) or even O(N). However, each fast summation method has auxiliary parameters, data structures and memory requirements which need to be supplied. The layout and implementation of such algorithms on a multi-core hardware strongly depends on the provided features of the underlying architecture.
In this project we turn our efforts towards a performance portable fast multipole method (FMM) for GPUs. Depending on the interest of the student, we can pursue different goals. First, the already available CUDA version of the FMM can be extended and tuned to support more advanced algorithmic and language (CUDA) features. Second, the C++AMP version of the FMM can be extended to support multi-GPUs and automatic off-loading to the GPU.
Project mentor: Andreas Beckmann
Site Co-ordinator: Ivo Kabadshow
Learning Outcomes
The student will familiarize himself with current state-of-the art GPUs and GPU clusters. He/she will learn how the GPU functions on a low level and use this knowledge to optimize software. He/she will use state-of-the art benchmarking tools to archive optimal performance for the kernels found to be relevant in the application.
Student Prerequisites (compulsory)
Programming knowledge for at least 5 years in C++
Student Prerequisites (desirable)
CUDA/C++AMP/GPU knowledge desirable, but not required, basic knowledge of C++ templates
Training Materials
Just send an email …we can provide early access to the GPU cluster and setup small exercises as well as technical reports from former students on the topic. We will adapt to your interests and level of knowledge.
Workplan
Week – Work package (can be adjusted)
- Training and introduction
- Benchmarking of kernels on the GPU
- Porting of new kernels (near field) to the GPU
- Porting of new kernels (far field) to the GPU
- Implementing Multi-GPU support
- Optimization and benchmarking, documentation
- Optimization and benchmarking, documentation
- Generating of final performance results. Preparation of plots/figures. Submission of results.
Final Product Description
The final product will be optimized kernel routines for the FMM. The benchmarking results, especially the gain in performance can be easily illustrated in appropriate figures, as is routinely done by PRACE and HPC vendors. Such plots could be used by PRACE.
Adapting the Project – Increasing the Difficulty
Different kernels require different levels of understanding of the hardware and of optimization strategies. For example it may or may not be required to optimize memory access patterns to improve cache utilization. A particularly able student may work on such a kernel. Depending on the knowledge level, a larger number of kernels can be ported or performance optimization can be intensified.
Adapting the Project – Decreasing the Difficulty
As explained above, a student that finds the task of optimizing a complex kernel too challenging, could very well restrict himself to fewer or simpler kernels.
Resources
The student will have his own desk in an open-plan office (12 desks in total) or in a separate office (2-3 desks in total), will get access (and computation time) on the required HPC hardware for the project and have his own workplace with fully equipped workstation for the time of the programme. A range of performance and benchmarking tools are available on site and can be used within the project. No further resources are required.
Organization
Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH
[…] 16. Multipole Toolbox for GPUs […]