Precision based differential checkpointing for HPC applications

Project reference: 2103
High Performance Computing (HPC) clusters have evolved to complex systems with a tremendous number of components. An HPC cluster is much more than the mere sum of its parts. A fast CPU is nothing without a fast memory. Without intelligent network topologies and high bandwidth, the data cannot be communicated between processes running on different nodes. The performance of one component depends on the interoperability with others. Besides, the architecture is only a piece of a whole. In order to address the synergies of the system, perfectly aligned drivers and countless software packages are necessary.
As supercomputers grow ever larger, the number of components does so in equal parts. The largest clusters comprise hundreds of thousands of nodes, that is hundreds of thousands of CPUs, GPUs, local storage devices, network cards, etc. With that, the likelihood of failures becomes a pressing issue. The mean-time-between-failures of a cluster at exascale is expected to lie within hours. As computing hours are expensive, it becomes prohibitive to run applications at that scale without any protection against failures. A common technique to protect an application is checkpoint-and-restart. In essence, this means writing snapshots of the application state to disk, and upon a failure, recover the state from the snapshot and continue execution. Sounds fairly easy, however, IO is typically the bottleneck of HPC applications and creative solutions are necessary to deliver good performance.
There are techniques that do not depend on IO, for instance, replication, redundancy, and preemptive process migration. Such methods, however, are often application specific or too expensive in terms of resources. In fact, checkpoint-and-restart is often the only viable solution. Modern checkpoint libraries leverage all available storage hierarchies in sophisticated checkpoint schemes. Some advanced techniques are checkpoint encoding, checkpoint staging, partner-node checkpointing, differential checkpointing, and incremental checkpointing. In some cases, checkpoint creation can be overlapped with the application execution, for instance, when part of the computation takes place on accelerators.
We would like to provide insight to fault tolerance techniques including all the techniques from above, and to supervise the development of a novel differential checkpointing method based on precision boundaries. Our objective is, implementing the developed mechanism into the multi-level checkpoint library FTI – Fault Tolerance Interface. The library is maintained at the BSC. FTI is already ported to numerous applications, such as ALYA, LULESH, GYSELA5D, Melissa-DA, HACC, and many others. The mechanism can be tested with those applications on several clusters, for instance, at MareNostrum4 and CTE-Power. The clusters are equipped with, SSD, and NVMe local storage devices. The student will learn to submit applications at scale and will be able to test her/his implementation on the clusters, leveraging the available storage technologies.
Project Mentor: Leonardo Bautista Gomez
Project Co-mentor: Kai Keller
Site Co-ordinator: Maria-Ribera Sancho and Carolina Olmopenate
Participants: Kevser İLDEŞ, Athanasios Kastoras
Learning Outcomes:
- Getting familiar with cluster scheduling
- Improve parallel programming skills
- Get insight into resiliency in HPC
Student Prerequisites (compulsory):
The student should be familiar with linux environments and should have basic knowledge of C.
Student Prerequisites (desirable):
A plus would be knowledge in shell programming, python, Fortran and C++.
Training Materials:
https://fault-tolerance-interface.readthedocs.io
Workplan:
Week 1) Learn how to access the clusters and how to submit jobs.
Week 2) Complete the FTI tutorial and to get in touch with the library.
Week 3) Implementation of the differential checkpointing.
Week 4) Implementation of precision calculations.
Week 5) Integrate precision calculations into differential checkpoint.
Week 6) Run experiments with HPC applications from different fields.
Week 7) Run larger experiments with more complex applications.
Week 8) Wrapping up and generating a report of the work done.
Final Product Description:
- A novel differential checkpointing implementation in FTI
- New insights about scope for precision based differential checkpointing
Adapting the Project: Increasing the Difficulty:
FTI provides checkpoint encoding using the Reed-Solomon algorithm. The encoding is not yet ported to GPU. This would be an interesting work to consider in the project.
Adapting the Project: Decreasing the Difficulty:
We can focus merely on experiments to investigate the dependency of certain kinds of applications on data precision.
Resources:
The students will need their own laptops and an appropriate IDE for software developing in C, Fortran, Python and C++
*Online only in any case
Organisation:
BSC – Barcelona Supercomputing Center
Leave a Reply