Making Quarks phli further

Project reference: 1515
Simulations of Lattice Quantum Chromodynamics (the theory of quarks and gluons) are used to study properties of strongly interacting matter and can, e.g., be used to calculate properties of the quark-gluon plasma, a phase of matter that existed a few milliseconds after the Big Bang (at temperatures larger than a trillion degrees Celsius). Such simulations take up a large fraction of the available supercomputing resources worldwide.
These simulations require the repeated computation of solutions of a particular linear system encoding a partial differential equation. The matrix involved is extremely sparse, i.e. the number of non-zero elements increases only linearly with the matrix dimension (and thus do the flops). This extremely sparse system is most effectively solved by means of a multi-grid method, which combines a generic solver (Krylov, FGMRES, multiplicative Schwartz) with another solver on a coarsened grid (lattice). Here, the generic method effectively reduces the errors of the approximate solution on the subspace of medium to large eigenvalues, whereas the coarse grid method is effective on the space of low eigenvalues of the matrix (which is well approximated on the coarser grid).
The student will be involved in tuning the most critical parts of this method for the target architecture Intel XeonPhi and Haswell (AVX2), as used in the DEEP-ER and Juropa II supercomputers. To that end, he/she will benchmark the method and identify the relevant kernels. He/she will analyze the performance of the kernels and identify performance bottlenecks and develop strategies to solve these, taking the similarities between the target architectures (long SIMD vectors) into account. He/she will optimize the kernels and document the steps taken in the optimization as well as the performance results achieved. In all this, the student is embedded in an extended infrastructure of hardware, computing, and benchmarking experts at the institute.
Project mentor: Stefan Krieg
Site Co-ordinator: Ivo Kabadshow
Learning Outcomes
The student will familiarize himself with a important new HPC architectures, Intel Xeon Phi and Haswell. He/she will learn how the CPU functions on a low level and use this knowledge to optimize software. He/she will use state-of-the art benchmarking tools to achieve optimal performance for the kernels found to be relevant in the application.
Student Prerequisites (compulsory)
Programming experience in C/C++
Student Prerequisites (desirable)
Knowledge of computer architectures, basic knowledge on numerical methods, basic knowledge on benchmarking, computer science, mathematics, or physics background
Training Materials
Workplan
Week – Work package
- Training and introduction
- Benchmarking of kernels and introduction to architectures
- Introduction to optimization tools
- Optimization and benchmarking
- Optimization and benchmarking, documentation
- Optimization and benchmarking, documentation
- Optimization and benchmarking, documentation
- Generating of final performance results. Preparation of plots/figures. Submission of results.
Final Product Description
The end product will be optimized kernel routines. The benchmarking results, especially the gain in performance can be easily illustrated in appropriate figures, as is routinely done by PRACE and HPC vendors. Such plots could be used by PRACE.
Adapting the Project – Increasing the Difficulty
Different kernels require different levels of understanding of the hardware and of optimization strategies. For example it may or may not be required to optimize memory access patterns to improve cache utilization. A particularly able student may work on such a kernel. Depending on the knowledge level, a larger number of kernels can be ported or performance optimization can be intensified.
Adapting the Project – Decreasing the Difficulty
As explained above, a student that finds the task of optimizing a complex kernel too challenging, could very well restrict himself to kernels with simple memory access patterns.
Resources
The student will have his own desk in an open-plan office (12 desks in total) or in a separate office (2-3 desks in total), will get access (and computation time) on the required HPC hardware for the project and have his own workplace with fully equipped workstation for the time of the programme. A range of performance and benchmarking tools are available on site and can be used within the project. No further resources are required.
Organization
Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH