[Mpi3-ft] Run-Through Stabilization Users Guide

Mon Jan 31 10:05:59 CST 2011

Josh, the document looks very nice. I'll pass it around LLNL.

Some comments:

There should be some discussion in the intro about how communication relates to failures (i.e. you can't talk to failed processes but you can talk to non-failed processes).

MPI_COMM_VALIDATE is supposed to be used to examine the list of failed ranks in detail after a call to MPI_COMM_VALIDATE_ALL. If there is a failure between the start of MPI_COMM_VALIDATE_ALL and the time MPI_COMM_VALIDATE is called, this list of failed ranks may be larger than the number returned by MPI_COMM_VALIDATE_ALL. There should be a way to remove this inconsistency by allowing the application to carry some kind of connection from MPI_COMM_VALIDATE_ALL to MPI_COMM_VALIDATE to make sure that the application knows which failures have been globally recognized and which only locally.

Does collective failure recognition imply local recognition?

Would it be better to explicitly have separate states for failures that have been recognized locally and collectively? If nothing else, you should change the language to clearly specify whether local or collective recognition is required. For example, some might get confused and think that getting an error from a collective constitutes recognition or that calling MPI_Comm_validate can make it possible to use collectives after a failure. These constraints are specified earlier in the text but its good to repeat the concept throughout.

In the broadcast example it would be easier if both examples used the same bcast algorithm. You don't really explain how the algorithm works, so it'll be easier to understand if you don't switch them.

For the Gather example it would be useful to have a picture.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com 

> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> bounces at lists.mpi-forum.org] On Behalf Of Joshua Hursey
> Sent: Friday, January 28, 2011 11:44 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: [Mpi3-ft] Run-Through Stabilization Users Guide
> 
> As mentioned on the teleconf, I created a Run-Through Stabilization Users
> Guide wiki page.
>   https://svn.mpi-forum.org/trac/mpi-forum-
> web/wiki/ft/run_through_users_guide
> 
> The purpose of this page is to introduce application developers to the high
> level philosophy of the run-through stabilization proposal, and some of the
> most commonly used interfaces and semantics. I included some example code
> snippets to help new users start programming against the emerging
> prototypes.
> 
> The page is not meant to be comprehensive, but just enough to get an
> application developer or new group member started with the ideas contained
> in the proposal. This is an alternative to the existing proposal which is
> structured to help the group form it into a formal proposal for the MPI
> forum, which can be confusing for those not steeped in the the MPI standard
> structure and language.
> 
> Let me know what you think, and if you have any suggestions on how to make
> the page more useful.
> 
> -- Josh
> 
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft