
Project reference: 2215
Today’s supercomputing hardware provides a tremendous amount of floating point operations (FLOPs). Hierarchical algorithms like the Fast Multipole Method (FMM) can achieve a very good utilization of the FLOPs coming from a single node. However, modern supercomputer consist of a large number of these powerful nodes and explicit communication is required to drive the computation and increase the scaling further.
Unfortunately, such communication is not a trivial thing, since we are not able to duplicate all data in memory at every node and also do not want to wait for communication longer than absolutely necessary.
In this project we want to look into the costs of communicating via message passing (MPI). Our target application will be the Fast Multipole Method (FMM) and during the project we will scale the application to a large number of compute nodes.
Depending on your interest and prior knowledge, we will pursue different goals. First, we want to have a look at the communication pattern of a simple Coulomb solver and implement a global communication pattern via MPI that does scale better than a naive point-to-point or global communication step. Second, the communication of the full FMM will be considered. Since, the algorithm consists of 5 different passes inside, different communication algorithms are required and can be benchmarked and tested.
The challenge of both assignments is scale to large number of nodes without introducing visible overheads from the communication layer.
What is MPI? MPI stands for message passing interface and is the de facto standard of communication between nodes in HPC. It provides hundreds of different collective communication primitives like broadcast, gather and scatter as well as point-to-point primitives like send, receive, put, and get. Due to its generic interface the user can build his own communication algorithms on top of MPI and thereby optimize the flow of data.
What is the fast multipole method? The FMM is a Coulomb solver and allows to compute long-range forces for molecular dynamics, e.g. GROMACS. A straightforward approach is limited to small particle numbers N due to the O(N^2) scaling. Fast summation methods such as PME, multigrid or the FMM are capable of reducing the algorithmic complexity to O(N log N) or even O(N). However, each fast summation method has auxiliary parameters, data structures and memory requirements which need to be provided. The layout and implementation of such algorithms on modern hardware strongly depends on the available features of the underlying architecture.

Inside the brain of the 2022 PRACE SoHPC student after two weeks of parallel thinking.
Project Mentor: Ivo Kabadshow
Project Co-mentor: Andreas Beckmann
Site Co-ordinator: Ivo Kabadshow
Learning Outcomes:
The students will learn that node to node communication can be a bottleneck for many algorithms in HPC. Additionally, the students will see how proper communication algorithms can circumvent and avoid the bottleneck and scaling to a large number of nodes can be improved.The students will learn that node to node communication can be a bottleneck for many algorithms in HPC. Additionally, the students will see how proper communication algorithms can circumvent and avoid the bottleneck and scaling to a large number of nodes can be improved.
Student Prerequisites (compulsory):
Prerequisites
- Programming knowledge for at least 5 years in C++
- Basic understanding of message passing
- “Extra-mile” mentality
Student Prerequisites (desirable):
- MPI or general node-node communication desirable, but not required
- C++ template metaprogramming
- Interest in C++11/14/17 features
- Interest in performance optimizations
- Ideally student of computer science, mathematics, but not required
- Basic knowledge on benchmarking, numerical methods
- Mild coffee addiction
- Basic knowledge of git, LaTeX, TikZ
Training Materials:
Just send an email … training material strongly depends on your personal level of knowledge. We can provide early access to the cluster as well as technical reports from former students on the topic. If you feel unsure about the requirements, but do like the project, send an email to the mentor and ask for a small programming exercise.
Workplan:
Week – Work package
- Training and introduction to FMMs and message passing
- Trivial implementation of parallel Coulomb solver
- Advanced implementation of parallel Coulomb solver
- First scaling benchmarks and optimizations
- Trivial implementation of parallel FMM
- Advanced implementation of parallel Coulomb solver
- Optimization and benchmarking, documentation
- Generating of final performance results. Preparation of plots/figures. Submission of results.
Final Product Description:
The final result will be a node-parallel FMM code via MPI to support scaling to a large number of nodes. The benchmarking results, especially the gain in performance can be easily illustrated in appropriate figures, as is routinely done by PRACE and HPC vendors. Such plots could be used by PRACE.
Adapting the Project: Increasing the Difficulty:
The FMM consists of multiple passes with different levels of parallelization difficulties. A particularly able student may also implement different strategies and pattern. Depending on the knowledge level, a more detailed dive into MPI (one-sided, group communicators, etc.) can take place to tune the performance further.
Adapting the Project: Decreasing the Difficulty:
Should a student that finds the task of optimizing a complex kernel too challenging, could restrict himself to simple or toy kernels, in order to have a learning experience. Alternatively, if the student finds a particular method too complex for the time available, a less involved algorithm can be selected.
Resources:
As explained above, a student that finds the task of adapting/optimizing the MPI communication layer to all passes too challenging, could very well restrict himself to a simpler model or a subset of FMM passes.
Organisation:
JSC-Jülich Supercomputing Centre