SoHPC 2021 – Kevser İLDEŞ
Hello everyone, I’m 22-year-old fresh graduate Kevser İLDEŞ. I am a computer engineer and come from Turkey. I got my bachelor’s degree education in Marmara University in Istanbul. I met with SoHPC through an e-mail from my professor and a bit doubtful at the beginning. But thanks to her encouragements, I decided to try my chance and surprisingly become one of participants 2021. At this point I want to say a couple of more things, I had almost lost this great opportunity since all of acceptance mails were unfortunately ended up in spam folder and saw the last one which fell in my inbox by chance. I want to thank to the coordinator of the program for not giving up on me despite all of those several e-mails.
I am selected for 2103 Precision based differential checkpointing for HPC applications which was my first choice. Now, let me tell you a bit about the project.
Reliability is a big issue especially in supercomputers. Supercomputers grow day to day and so does number of components. In High performance computing (HPC), systems are built from highly reliable components. But with the increase in number, the likelihood of failure becomes a serious issue as the overall failure rate of supercomputers also increases. It is expected that the mean-time-between-failures lie within hours and since computing hours are expensive, it becomes prohibitive to run applications at that scale without any protection against failures and fault tolerance (FT) is a well-known issue.
There are several techniques to protect the application and checkpoint-and-restart is a common one. It means taking snapshots of the application at specific times, that is, saving the system state in stable storage, frequently a parallel file system (PFS), and in case of a failure restarting the execution from the last recovery point. CR technique is relatively inexpensive but when checkpointing a large application, it may lead to an I/O bottleneck since the I/O bandwidth of supercomputers does not increase at the same speed as computational capabilities.
There are other techniques that do not depend on I/O however, as they are often application specific or too expensive in terms of resources checkpoint-and-restart is often the only viable solution. Modern checkpoint libraries leverage all available storage hierarchies in sophisticated checkpoint schemes and checkpoint encoding, checkpoint staging, differential and incremental checkpointing are some of advanced techniques.
The objective of the project is to implement the developed mechanism into the multi-level checkpoint library Fault Tolerance Interface (FTI) which is maintained at the BSC. For more information about the project and FTI, follow this link.
Training Week – First Impression
First week of the program was the training week and they give us an introduction to the program. Besides they give us insights about Python, OpenMP, parallel programming with MPI and many others. Additionally, with hands-on labs, they made us practice and they also answer each question carefully and patiently, helps everyone to get to know these technologies.
I met my mentors a couple of weeks earlier from training week and had a short talk at the first day of the training week. I was a bit excited but after having a talk with them, I had relaxed so much. They are all helpful and warmhearted. I hope we would have a great work together. Let’s see how this adventure will go.
I wish everyone a great time. I will be around during this internship. Don’t hesitate to contact me in case of anything.