Project reference: 1619
Simulations of Lattice Quantum Chromodynamics (the theory of quarks and gluons) are used to study properties of strongly interacting matter and can, e.g., be used to calculate properties of the quark-gluon plasma, a phase of matter that existed a few milliseconds after the Big Bang (at temperatures larger than a trillion degrees Celsius). Such simulations take up a large fraction of the available supercomputing resources worldwide.
These simulations require the repeated computation of solutions of a particular linear system encoding a partial differential equation. The matrix involved is extremely sparse, i.e. the number of non-zero elements increases only linearly with the matrix dimension (and thus do the flops). This extremely sparse system is most effectively solved by means of a multi-grid method (MG), which combines a generic solver (Krylov, FGMRES, multiplicative Schwartz) with another solver on a coarsened grid (lattice). Here, the generic method effectively reduces the errors of the approximate solution on the subspace of medium to large eigenvalues, whereas the coarse grid method is effective on the space of low eigenvalues of the matrix (which is well approximated on the coarser grid).
The combination of different formulations of Lattice QCD (discretizations such as Wilson or staggered fermions) and the different flavors of MG methods as well as target architectures opens a large space of possibilities for a student to explore looking for the optimal matching between methods and machines.
Depending on personal preference, the student will be involved in tuning and scaling the most critical parts of a specific method, or attempt to optimize for a specific architecture in the algorithm space.
In the former case, the student can select among different target architectures, ranging from Intel XeonPhi, Haswell (AVX2) or GPUs (OpenPOWER), which are available at the institute. To that end, he/she will benchmark the method and identify the relevant kernels. He/she will analyse the performance of the kernels, identify performance bottlenecks, and develop strategies to solve these – if possible taking similarities between the target architectures (such as SIMD vectors) into account. He/she will optimize the kernels and document the steps taken in the optimization as well as the performance results achieved.
In the latter case, the student will, after familiarizing himself with the architectures, explore different methods by implementing them or using those that have either already been implemented. He/she will explore how the algorithmic properties match the hardware capabilities. He/she will test the archived total performance, and study bottlenecks e.g. using profiling tools. He/she will then test the method at different scales and document the findings.
In any case, the student is embedded in an extended infrastructure of hardware, computing, and benchmarking experts at the institute.
Project Mentor: Dr. Stefan Krieg
Site Co-ordinator: Ivo Kabadshow
Student: Peter Labus
The student will familiarize himself with important new HPC architectures, Intel Xeon Phi, Haswell, and OpenPOWER. He/she will learn how the hardware functions on a low level and use this knowledge to optimize software. He/she will use state-of-the art benchmarking tools to achieve optimal performance for the kernels found to be relevant in the application.
Student Prerequisites (compulsory):
Programming experience in C/C++
Student Prerequisites (desirable):
- Knowledge of computer architectures
- Basic knowledge on numerical methods
- Basic knowledge on benchmarking
- Computer science, mathematics, or physics background
https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization, https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors, https://developer.nvidia.com/cuda-zone, http://www.openacc.org/content/education
Week 1: Training and introduction
Week 2: Introduction to architectures
Week 3: Introductory problems
Week 4: Introduction to methods
Week 5: Optimization and benchmarking, documentation
Week 6: Optimization and benchmarking, documentation
Week 7: Optimization and benchmarking, documentation
Week 8: Generating of final performance results. Preparation of plots/figures. Submission of results.
Final Product Description:
The end product will be a student educated in the basics of HPC, optimized kernel routines and/or optimized methods. These results can be easily illustrated in appropriate figures, as is routinely done by PRACE and HPC vendors. Such plots could be used by PRACE.
Adapting the Project: Increasing the Difficulty:
A) Different kernels require different levels of understanding of the hardware and of optimization strategies. For example it may or may not be required to optimize memory access patterns to improve cache utilization. A particularly able student may work on such a kernel.
B) Methods differ greatly in terms of complexity. A particularly able student may choose to work on more advanced algorithms.
The student will have his own desk in an open-plan office (12 desks in total) or in a separate office (2-3 desks in total), will get access (and computation time) on the required HPC hardware for the project and have his own workplace with fully equipped workstation for the time of the program. A range of performance and benchmarking tools are available on site and can be used within the project. No further resources are required.