Summary of HPC
SoHPC started about 2 months ago… If you tell me this I think you are lying: how can so many events happen in just two months? I feel like a year passed since the start: we dealt with so many things that I can hardly remember the first ones. By the way, it’s time now to summarize what has been done and how to continue our effort in this project.
I’ve already talked about persistent memory in my last post. I promised to bring you some numbers, but it was harder than I thought: the cluster we worked on used memory caching to reduce the impact of the disk operations. We had to circumvent caching to correctly measure the speed-ups, and this took a little bit longer than expected. We managed to get the results we were looking for, and my mate already made a good post on it (check it out here). To cut the story short, persistent memory showed a big improvement over the uncached disk.
Our project was complete, but we could go further: the next thing in our TODO list was transactional checkpointing. The term seemed a little bit odd to me: how can checkpoints be transactional? For the ones of you that don’t know it, transactionality is usually a property of databases. It is based on the ACID properties, where the name is the acronym of Atomicity, Consistency, Isolation, Durability. These create strong limits on the effects of what can be done in a database: only complete and correct changes will affect the actual data.
This brings us back to the above question: how can a concept so much related to another field be related to checkpoints? It is possible to find the answer if we associate data changes with tasks: only the ones that complete correctly can influence the others. So we started analyzing how can processes influence each other and we came up with an important conclusion: since there is no shared memory in Charm++, the only way two tasks can influence each other is via message passing.
This conclusion showed us the way to pursuit to achieve transactional checkpointing: each task shouldn’t be able to communicate with the others until the end of its execution when it’s possible to decide about its correctness. All the outgoing messages should be stored and sent only at the end. We managed to postpone the sending of the messages, but then we faced a bigger problem: transactionality forces to either send all messages or none. To be sure to send all and only all the messages we had to base on checkpointing, and in particular asynchronous checkpointing, a feature not supported by Charm++. Introducing asynchronous checkpointing in such a complex framework proved to be non-trivial, we tried for a lot but we got no remarkable result. This was a bit frustrating: we were so close to the goal, yet so distant.
Now that everything is finished I miss all those everlasting calls where I and Petar would surf on thousands of code-lines just to change little things. I liked how we worked, and I think we couldn’t achieve the same result by ourselves. We managed to take out the best from each other and complete this project that would have been impossible for many other groups. An additional thank goes to Oliver, always ready to solve all our doubts and issues: being able to discuss and clarify all the problems we encountered helped us a lot.
So this is the end of this last blog-post. I’m going to leave you with the video presentation we prepared: it’s a short one but is very exhaustive and also contains all the data obtained from the testing phases. If you have any questions feel free to contact me on LinkedIn, I’m always happy to answer questions about HPC.