[Mpi3-ft] Run-Through Stabilization Users Guide

Joshua Hursey jjhursey at open-mpi.org
Tue Feb 1 08:41:22 CST 2011

On Jan 31, 2011, at 11:05 AM, Bronevetsky, Greg wrote:

> Josh, the document looks very nice. I'll pass it around LLNL.
> Some comments:
> There should be some discussion in the intro about how communication relates to failures (i.e. you can't talk to failed processes but you can talk to non-failed processes).

Good idea. I'll add some language for that in the first section.

> MPI_COMM_VALIDATE is supposed to be used to examine the list of failed ranks in detail after a call to MPI_COMM_VALIDATE_ALL. If there is a failure between the start of MPI_COMM_VALIDATE_ALL and the time MPI_COMM_VALIDATE is called, this list of failed ranks may be larger than the number returned by MPI_COMM_VALIDATE_ALL. There should be a way to remove this inconsistency by allowing the application to carry some kind of connection from MPI_COMM_VALIDATE_ALL to MPI_COMM_VALIDATE to make sure that the application knows which failures have been globally recognized and which only locally.

That is a common way to use the MPI_COMM_VALIDATE function. By tethering the two functions, does this mislead the application by reporting less than the full set of known failures? The user could supply an 'incount' of something larger than what was reported by MPI_COMM_VALIDATE_ALL, then inspect the 'outcount' of MPI_COMM_VALIDATE to determine if they agree. If they do not then further action may need to be taken.

The only reason you would need to call MPI_COMM_VALIDATE is if you cared about which individual ranks failed. If you only care about creating recovery blocks, then you only need the count from MPI_COMM_VALIDATE_ALL, and don't have to worry about sufficient buffer space. If there is only a subset of the ranks that are critical, then you can use MPI_COMM_VALIDATE_RANK to inspect just those ranks.

I do see the point though. I wonder if an additional, combination function would be useful for this case. Something like:
MPI_COMM_VALIDATE_ALL_FULL_REPORT(comm, incount, outcount, totalcount, rank_infos)
comm: communicator
incount: size of rank_infos array
outcount: number of rank_info entries filled
totalcount: total number of failures known
rank_infos: array of MPI_Rank_info types

This would allow the user to determine if they are getting a report of all the failures (outcount == totalcount), or just a subset because they did not supply a sufficiently allocated buffer (outcount < totalcount). This does force the user to allocate the rank_infos buffer before it may be needed, but if the type of consistency that you cite is needed then maybe this is not a problem.

> Does collective failure recognition imply local recognition?

Yes. I added a line about this under the validation section for clarity.

> Would it be better to explicitly have separate states for failures that have been recognized locally and collectively?

I don't think so. I think this starts to become confusing to the user, and muddles the semantics a bit. If global/collective recognition does not imply local recognition, then do we require that the user both locally and globally recognize a failure before they can create a new communicator? What if the communicator has a global recognition of failures, but not locally? In that case collectives will succeed, but only some point-to-point operations. This seems to be adding more work for the application without a clear use case on when it would be required.

> If nothing else, you should change the language to clearly specify whether local or collective recognition is required. For example, some might get confused and think that getting an error from a collective constitutes recognition or that calling MPI_Comm_validate can make it possible to use collectives after a failure. These constraints are specified earlier in the text but its good to repeat the concept throughout.

Good point. I'll clarify the text in the collective section.

> In the broadcast example it would be easier if both examples used the same bcast algorithm. You don't really explain how the algorithm works, so it'll be easier to understand if you don't switch them.

I can see that. I think explaining the way that different implementations may cause things to go wonky might be useful. I cleaned up the language a bit to describe the two algorithms and why we discuss them here. I would be fine with dropping one of the bcast illustrations if you all think it is still too confusing.

> For the Gather example it would be useful to have a picture.

I'll see what I can do.

I updated the wiki page with the notes above, minus the Gather image which will come later.

Thanks for the feedback.

-- Josh

> Greg Bronevetsky
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky at llnl.gov
> http://greg.bronevetsky.com 
>> -----Original Message-----
>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>> bounces at lists.mpi-forum.org] On Behalf Of Joshua Hursey
>> Sent: Friday, January 28, 2011 11:44 AM
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> Subject: [Mpi3-ft] Run-Through Stabilization Users Guide
>> As mentioned on the teleconf, I created a Run-Through Stabilization Users
>> Guide wiki page.
>>  https://svn.mpi-forum.org/trac/mpi-forum-
>> web/wiki/ft/run_through_users_guide
>> The purpose of this page is to introduce application developers to the high
>> level philosophy of the run-through stabilization proposal, and some of the
>> most commonly used interfaces and semantics. I included some example code
>> snippets to help new users start programming against the emerging
>> prototypes.
>> The page is not meant to be comprehensive, but just enough to get an
>> application developer or new group member started with the ideas contained
>> in the proposal. This is an alternative to the existing proposal which is
>> structured to help the group form it into a formal proposal for the MPI
>> forum, which can be confusing for those not steeped in the the MPI standard
>> structure and language.
>> Let me know what you think, and if you have any suggestions on how to make
>> the page more useful.
>> -- Josh
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list