Improved performance with hybrid programming
Project reference: 2023
Nowadays, most High-Performance Computing (HPC) systems are clusters that are built out of shared-memory nodes. The nodes itself consist of several CPUs that provide a couple of cores where the code doing the actual computational work gets executed. However, with current and future systems we observe a trend towards more cores per CPU together with less memory available per core and also reduced communication bandwidth per core.
A major topic in HPC is to provide programming tools that enable the programmer to write parallel codes capable of scaling to hundreds of thousands, or even millions, of cores. The de-facto standard here is using the Message Passing Interface (MPI).
MPI might be used alone as pure MPI. However, this might not be the best choice on state-of-the-art HPC systems because of the memory overhead involved. It’s the memory overhead that comes from MPI itself which will increase with the number of MPI processes. And even more important, when applying pure MPI memory might be wasted because of replicated data within a CPU or within a node.
A better way is a combination of using MPI for the distributed memory parallelization on the node interconnect together with a shared-memory parallelization inside of the nodes such as OpenMP or the MPI-3.0 shared-memory model which is usually referred to as hybrid programming techniques MPI+X.
In this SoHPC project we will explore different options of optimizing both the memory consumption and the communication time of a prototypical memory-bound scientific code. We will work with a Jacobi solver that is a simple stencil code with halo communication in 2 and 3 dimensions. A prototype of this code in 2 dimensions for pure MPI and MPI+OpenMP already exists. It will be the task of the SoHPC student to add also an MPI+MPI-3.0-shared-memory version and to extend the code to 3 dimensions. Especially when it comes to applying pure MPI, virtual Cartesian topologies provide a way to reduce the amount of halo data that has to be communicated between the MPI processes as well as – at least in principle – also a way to reorder the MPI processes to optimize their placement with respect to the hardware topology. Currently none of the MPI libraries offers such a reordering, but the principle is known and can be applied. Starting from a simple Roofline performance model for our code, careful performance measurements will serve to analyse the strengths and weaknesses of the various approaches.
Project Mentor: Claudia Blaas-Schenner
Project Co-mentor: Irene Reichl
Site Co-ordinator: Claudia Blaas-Schenner
After working on this project you will be able to designate the key limitations of using a pure MPI programming model on modern HPC clusters and be able to apply several concepts of hybrid programming MPI+X to optimize both memory consumption and communication time of parallel codes. You will also be able to use profiling tools to evaluate the performance of your codes.
Student Prerequisites (compulsory):
We welcome a basic scientific mindset, curiosity, a keen interest in challenging technical innovations and the appreciation of outside-the-box thinking. The student should be able to work on the Linux command line, have a basic knowledge in programming with either C or Fortran, and know at least the basic principles of parallel programming with MPI.
Student Prerequisites (desirable):
Good programming knowledge with either C or Fortran. Experience with programming with MPI.
The material of our course ‘Introduction to Hybrid Programming in HPC’ http://tiny.cc/MPIX-VSC is a good resource.
- Week 1: setting up, getting familiar with our ideas about MPI+X
- Week 2-3: literature study, first runs, workplan (deliverable)
- Week 4-6: intense coding and performance analysis
- Week 7-8: producing the video (deliverable), writing the final report (deliverable)
Final Product Description:
During this SoHPC project you will have developed a prototype of a scientific code implementing various forms of hybrid programming MPI+X – this will likely be one or more code files. In addition you will produce a couple of performance measurements of these codes that will most likely form the basis for the final SoHPC-project report.
Adapting the Project: Increasing the Difficulty:
Either by programming and evaluating the same MPI+X options for a prototypical compute-bound code or by extending the original code towards MPI+CUDA to make use of the GPUs.
Adapting the Project: Decreasing the Difficulty:
By omitting the extension to 3 dimensions.
No major resources needed. You should bring your own laptop. We will provide access and computing time on VSC-3 (0.60 PFlop/s) and VSC-4 (2.7 PFlop/s), two Intel-based cluster systems that we host and manage, together with all necessary software products and licences.
VSC Research Center