Checkpoint-restart is commonly used to provide resilience to fail-stop faults (e.g. node failures) for HPC applications. However, as mean-time-to-failure shortens with increasing system size, checkpoint-restart does not scale as it is not possible to checkpoint the entire system memory between failures.
Alternative models such as MPI User-Level Fault Mitigation  and Resilient X10  have not addressed one-sided communication, which creates particular challenges for maintaining correctness and progress in the presence of process failures.
This work could start with a baseline of either MPI-3  or GASNet  and define control flow, update semantics and recovery operations for resilient operation in the presence of arbitrary process failures.
 Bland et al. (2012) An evaluation of user-level failure mitigation support in MPI
 Cunningham et al. (2014) Resilient X10: efficient failure-aware programming
 Gerstenberger, Besta, and Hoefler (2014) Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided