[mpiwg-ft] FTWG Forum Update
engelmannc at computer.org
Fri Mar 7 09:54:15 CST 2014
Wesley, thanks for the update.
Martin and Todd, there has been some work in this area of checkpoint/restart where the MPI runtime stays alive. I would like to point out Wesley’s own work (http://www.netlib.org/utk/people/JackDongarra/PAPERS/CoF-europar2012.pdf), as well as, Frank Mueller’s work (http://www.christian-engelmann.info/publications/wang07job.pdf). Also, Matttan Erez is looking at a similar approach for his containment domain work.
The original idea of the MPI Fault Tolerance Working group was to develop a proposal that allows for a multitude of solutions. I see this new “proposal” as an extension of the exiting proposal that uses a subset of its features, requiring additional system-level checkpoint/restart features (e.g. long jump and MPI state roll-back) be part of the MPI standard.
Christian Engelmann, Ph.D.
System Software Team Task Lead / R&D Staff Scientist
Computer Science Research Group
Computer Science and Mathematics Division
Oak Ridge National Laboratory
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA
Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491
e-Mail: engelmannc at ornl.gov / Home: www.christian-engelmann.info
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft