[mpiwg-ft] FTWG Forum Update

Christian Engelmann engelmannc at computer.org
Fri Mar 7 09:54:15 CST 2014

Hi all!

Wesley, thanks for the update.

Martin and Todd, there has been some work in this area of checkpoint/restart where the MPI runtime stays alive. I would like to point out Wesley’s own work (http://www.netlib.org/utk/people/JackDongarra/PAPERS/CoF-europar2012.pdf), as well as, Frank Mueller’s work (http://www.christian-engelmann.info/publications/wang07job.pdf). Also, Matttan Erez is looking at a similar approach for his containment domain work.

The original idea of the MPI Fault Tolerance Working group was to develop a proposal that allows for a multitude of solutions. I see this new “proposal” as an extension of the exiting proposal that uses a subset of its features, requiring additional system-level checkpoint/restart features (e.g. long jump and MPI state roll-back) be part of the MPI standard.



Christian Engelmann, Ph.D.

System Software Team Task Lead / R&D Staff Scientist
Computer Science Research Group
Computer Science and Mathematics Division
Oak Ridge National Laboratory

Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA
Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491
e-Mail: engelmannc at ornl.gov / Home: www.christian-engelmann.info

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20140307/251ba352/attachment-0001.html>

More information about the mpiwg-ft mailing list