Implementation of Lossy Compression to Optimize Checkpointing
Welcome back to my third and final blog post for the Summer of HPC. This is the final week of SoHPC and between the presentations and the report writing, I decided to write my final blog post to explain my work in more depth. In the previous blog post, I explained what fault tolerance is and why it is essential. Also, I introduced the FTI library and mentioned lossy compression as a technique to limit the IO bottleneck. In this blog post, I will explain more extensively the implementation of lossy compression in the FTI, the results of the until now experiments, and my future expectations. If you haven’t, I suggest reading my previous blog post so that you can understand the basic concepts of the FTI library in which the implementation was made.
To execute FTI one must specify certain parameters in the configuration file (configuration example). To use the lossy compression feature, three extra parameters must be specified: compression_enabled, cpc_block_size, and cpc_tolerance. The first one is simply a boolean value which, when set to 1, allows the checkpoints to be compressed. The second one specifies the maximum memory size which can be allocated for the compression and is used to protect memory restricted systems. It is suggested that the block size is not significantly lower than the actual data size, because then it will interfere with the compression and may decrease the compression ratio. The last parameter defines the absolute error tolerance for the compression. The actual error tolerance is given by the function 10-t where t is the integer cpc_tolerance specified at the configuration file. The user must specify a tolerance suited for their application, high enough to maximize the compression rate, but low enough so the results won’t be altered.
As a principle, lossy compression targets large-scale applications. Although the performance of lossy compression differs for different data sets, in programs with small memory, using lossy compression may even slow down checkpointing due to the time needed to perform the compression and the possibility of hard to compress data. Therefore, to test the application we have to employ a very scalable application. For our experiments, we run LULESH on BSC’s Marenostrum cluster, with a size of 615,85 Megabytes per process, for 512 processes in 16 different nodes, using different values for cpc_tolerance. The experiment can be considered successful since a measurable decrease in the time to write the checkpoint was observed. As you can see in figures 1 and 2, the write times of the checkpoints for any value of tolerance between zero and eight seem to be around half the write time when the checkpoint isn’t compressed. Also, we can see that tolerance doesn’t really affect write time, since for different tolerances we don’t observe high differences at the checkpoint file. This may change in other experiments with different data consistency or size.
For the closing meeting of SoHPC, my partner Kevser and I created a five minute video presentation explaining our project.
That summarizes our work in the SoHPC, I hope you found it interesting. As this experience comes to an end I would like to express my gratitude to the organizers and project supervisors of SoHPC for the great work they are doing. During the summer I learned a whole lot of new things, gained experience in working as a researcher, and collaborated with many inspiring people. Finally, I undoubtedly encourage every student interested in HPC, and computing in general, to apply to SoHPC because it is a very positive and unforgettable experience, which I am grateful that I had during my studies.