Tiny, tiny, tasks! Huge Impact?
Project reference: 2121
Today’s supercomputing hardware provides a tremendous amount of floating point operations (FLOPs). However, most FLOPs can only be harvested easily, if the algorithm does exhibit lots of parallelism. Additionally, efficient use of resources strongly depends on the underlying tasking framework and its mechanisms to distribute work to the cores.
In this project we turn our efforts towards a our tasking framework “Eventify”. We will investigate the performance and bottlenecks of our taskified Fast Multipole Method (FMM) as well as some easier microbenchmarks.
Depending on your scientific background we will pursue different goals.
First, Eventify can be utilized to implement a set of microbenchmarks to compare against other parallelization approaches and paradigms like OpenMP or HPX.
Second, Eventify can be benchmarked in a full FMM run and compared against other parallelization approaches and paradigms like OpenMP or HPX.
The challenge of both assignments is to execute/schedule tiny to medium-size compute kernels (tasks) without large overheads within the tasking frameworks. We will utilize algorithm knowledge in order to speed things up and circumvent synchronization bottlenecks.
What is the fast multipole method? The FMM is a Coulomb solver and allows to compute long-range forces arising in molecular dynamics, plasma or astrophysics. A straightforward approach is limited to small particle numbers N due to the O(N^2) scaling. Fast summation methods such as PME, multigrid or the FMM are capable of reducing the algorithmic complexity to O(N log N) or even O(N). However, each fast summation method has auxiliary parameters, data structures and memory requirements which need to be provided. The layout and implementation of such algorithms on modern hardware strongly depends on the available features of the underlying architecture.
Project Mentor: Ivo Kabadshow
Project Co-mentor: Mateusz Zych
Site Co-ordinator: Ivo Kabadshow
The student will familiarize himself with current state-of-the art HPC harware. He/she will learn how parallelization should be performed on a low level and use this knowledge to utilize/benchmark/extend our tasking framework in a modern C++ code-base. He/she will use state-of-the art benchmarking/profiling tools to test and improve performance for the tasking framework and its compute kernels which are time-critical in the application.
Special emphasis will be placed on different approaches each tasking framework provides. The student will learn how additional knowledge from the algorithmic workflow and dependencies can influence the parallel performance to improve.
Student Prerequisites (compulsory):
- Programming knowledge for at least 5 years in C++
- Basic understanding of template metaprogramming
- “Extra-mile” mentality
Student Prerequisites (desirable):
- C++ template metaprogramming
- Interest in C++11/14/17 features
- Interest in low-level performance optimizations
- Ideally student of computer science, mathematics, but not required
- Basic knowledge on benchmarking, numerical methods
- Mild coffee addiction
- Basic knowledge of git, LaTeX, TikZ
Just send an email … training material strongly depends on your personal level of knowledge. We can provide early access to the HPC cluster as well as technical reports from former students on the topic. If you feel unsure about the requirements, but do like the project, send an email to the mentor and ask for a small programming exercise.
Week – Work package
- Training and introduction to Eventify and HPC hardware
- Development of a small microbenchmark in OpenMP and HPX
- Development of a small microbenchmark in Eventify
- Comparison of the different tasking frameworks
- Extending parts of the FMM with HPX and/or OpenMP
- Benchmarking of the HPX/OpenMP FMM
- Optimization and benchmarking, documentation
- Generating of final performance results. Preparation of plots/figures. Submission of results.
Final Product Description:
The final result will be a good understanding of today’s parallelization possibilities in HPC. The benchmarking results, especially the gain in performance can be easily illustrated in appropriate figures, as is routinely done by PRACE and HPC vendors. Such plots could be used by PRACE.
Adapting the Project: Increasing the Difficulty:
The tasking framework can express different levels of parallelism. A particularly able student may also benchmark more parts of the algorithm or implement more complex algorithms.
Adapting the Project: Decreasing the Difficulty:
As explained above, a student that finds the task of adapting/optimizing the tasking framework too challenging, could very well restrict himself to a simpler model or partial set of FMM.
The student will get access (and computation time) on the required HPC resources for the project. A range of performance and benchmarking tools are available on site and can be used within the project. No further resources are required. Hint: We do have experts on all advanced topics, e.g. C++11/14/17, in house. Hence, the student will be supported when battling with ‘bleeding-edge’ technology.
JSC-Jülich Supercomputing Centre