Charm++ Fault Tolerance with Persistent Memory

Charm++ Fault Tolerance with Persistent Memory
Messages are sent between processes. When a message arrives on process, it invokes a method on a Chare. Chare’s are checkpointed regularly to disk, but if the disk is replaced by persistent memory, it might be possible to maintain a live copy of the Chare’s stat without major loss of performance.

Project reference: 2005

Charm++ is an open source, parallel programming framework in C++, supported by an adaptive runtime system. It uses objects called chares which hold data, and methods to act on that data, which can be invoked emotely by sending messages to that chare.  A message may also contain data, which allows data dependencies to be met. Once a method has been invoked remotely, the chare is scheduled to run its method by an adaptive runtime scheduler. This allows programs to be broken down into a sequence of asynchronous parallel tasks.

One feature of Charm++ is fault tolerance – if a node goes down, chares can be rescheduled to allow the program to continue. This requires regular checkpointing to disk, something which can lead to poor performance. Persistent memory promises performance closer to RAM, but with the persistence of writing to disk! With hardware like that, it may even be feasible to make a Charm++ application completely transactional, so that almost no progress is ever lost in the event of a checkpoint restart.

In this project you will try to rewrite Charm++’s fault tolerance to take advantage of Intel Optane DC persistent memory. The required hardware will be provided by EPCC, courtesy of the NEXTGenIO project [http://www.nextgenio.eu/].

Messages are sent between processes. When a message arrives on process, it invokes a method on a Chare. Chare’s are checkpointed regularly to disk, but if the disk is replaced by persistent memory, it might be possible to maintain a live copy of the Chare’s stat without major loss of performance.

Project Mentor: Dr. Oliver Thomson Brown

Project Co-mentor: Dr. Nick Brown

Site Co-ordinator: Juan Herrera

Learning Outcomes:
Student will have learnt about the importance of asynchronous parallelism, and fault tolerance to the future of HPC. They will have learnt how to program distributed task-based parallel code using Charm++, and how to program Intel Optane persistent memory.

Student Prerequisites (compulsory)

  • Experience with UNIX-based operating systems.
  • Strong programming skills.
  • Experience with object-oriented programming.

Student Prerequisites (desirable):

  • Experience with C/C++.
  • Experience with message-passing parallelism.
  • Experience with task-based parallelism.

Training Materials:

Workplan:

  • Weeks 1-3: Student familiarises themselves with Charm++ and Intel Optane persistent memory, and develops detailed plan.
  • Weeks 3-6: Student does technical work to achieve their planned goal – replacing Charm++ checkpointing to disk with persistent memory, and/or making it ACID compliant.
  • Weeks 7-8: Student writes up project.

Final Product Description:
Student will have produced a small example code using Charm++ which checkpoints to Intel Optane DC persistent memory.

Adapting the Project: Increasing the Difficulty:
Student can try to make Charm++ transactional and ACID compliant using persistent memory.

Adapting the Project: Decreasing the Difficulty:
Student can experiment with the performance benefits of persistent memory versus writing to disk, and write simple test programs in Charm++.

Resources:

Organisation:
EPCC

EPCC

Please follow and like us:
error
Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.