Shape up or ship out – You decide!
Project reference: 1618
Modern multi- and many-core hardware architectures provide a huge amount of floating point operations (FLOPs). However, CPU and GPU FLOPs cannot be harvested in the same manner. While the CPU is designed to minimize the latency of of a stream of individual operations, the GPU tries to maximize the throughput. At this stage, the code developer is faced with a decision to exhibit more parallelism of the algorithm to support the GPU execution path or optimize the code further for the CPU.
The problem becomes especially interesting when the compute-intense compute kernels are already executed on the GPU. Now, the decision to ‘ship-out’ code paths to the GPU has to be made for the kernel driver routines.
However, the great diversity and short life cycle of todays HPC hardware does not allow for hand-written, well-optimized assembly code anymore. This begs the question if we can utilize more performance from high-level languages with greater abstraction possibilities like C++.
In this project we focus the computation of long-range forces for molecular dynamics within GROMACS. A straightforward approach is limited to small particle numbers N due to the O(N^2) scaling. Fast summation methods like PME, multigrid or fast multipole methods are capable of reducing the algorithmic complexity to O(N log N) or even O(N). However, each fast summation method has auxiliary parameters, data structures and memory requirements which need to be supplied. The layout and implementation of such algorithms on a multi-core hardware strongly depends on the provided features of the underlying architecture.
In this project we turn our efforts towards a performance-portable fast multipole method (FMM). Depending on the interest of the student, we will pursue different goals. First, the already available CPU/GPU version of the FMM can be extended to allow automatic SIMD/SIMT-ization for the construction and efficient use of multiple sparse/dense octree datastructures. Second, a special kernel for more advanced algorithmic computations can be ported to the GPU using CUDA.
The challenge of both assignments is to embed/extend the code in a performance-portable way. This ensures minimized adaptation efforts when changing from one HPC platform to another.
Project Mentor: Andreas Beckmann
Site Co-ordinator: Ivo Kabadshow
Student: Johannes Pekkilä
The student will familiarize himself with current state-of-the art GPUs (e.g. K40/K80) and GPU clusters (Jureca). He/she will learn how the GPU functions on a low level and use this knowledge to optimize scientific software for CPUs and GPUs in a unified code-base. He/she will use state-of-the art benchmarking tools to achieve optimal performance for the kernels and kernel drivers which are time-critical in the application.
Student Prerequisites (compulsory):
- Programming knowledge for at least 5 years in C++
- Basic understanding of template metaprogramming
- “Extra-mile” mentality
Student Prerequisites (desirable):
- CUDA/GPU knowledge desirable, but not required
- Interest in C++11/14 features
- ideally student of computer science, mathematics, but not required
- Basic knowledge on benchmarking, numerical methods
- Basic knowledge of git, LaTeX
Just send an email … training material strongly depends on your personal level of knowledge. We can provide early access to the GPU cluster as well as technical reports from former students on the topic. If you feel unsure if you fulfill the requirements, but do like the project send an email to the mentor and ask for a small programming exercise.
Week 1 Training and introduction to FMMs
Week 2 Benchmarking of octree variants on the CPU
Week 3 Initial port of one or two octree variants to the GPU
Week 4 Unifying the octree data structure for the CPU/GPU
Week 5 Implementing Multi-GPU support
Week 6 Optimization and benchmarking, documentation
Week 7 Optimization and benchmarking, documentation
Week 8 Generating of final performance results. Preparation of plots/figures. Submission of results.
Final Product Description:
The end product will be an extended FMM with support for multi-octree support on the GPU. The benchmarking results, especially the gain in performance can be easily illustrated in appropriate figures, as is routinely done by PRACE and HPC vendors. Such plots could be used by PRACE.
Adapting the Project: Increasing the Difficulty:
The octree data structure (kernel driver) is used in many places differently in the code. For example it may or may not be required to use a sparse implementation of the datastructure. A particularly able student may also port multiple implementations. Depending on the knowledge level, a larger number of access/storage strategies can be ported or perfomance optimization can be intensified.
The student will have his own desk in an air-conditioned open-plan office (12 desks in total) or in a separate office (2-3 desks in total), will get access (and computation time) on the required HPC hardware for the project and have his own workplace with fully equipped workstation for the time of the program. A range of performance and benchmarking tools are available on site and can be used within the project. No further resources are required. Hint: We do have experts on all advanced topics, e.g. C++11/14, CUDA in house. Hence, the student will be supported when battling with ‘bleeding-edge’ technology.