[Mpi3-ft] Updates to ft chapter

Josh Hursey jjhursey at open-mpi.org
Thu Sep 15 12:27:24 CDT 2011

I made some minor-ish edits. I attached the diff to this email for
review. Feel free to commit it if you think it is good to go.

Some larger items that I did not want to change/adjust before
discussing with the group:

 Does the Advice to Implementors buy us anything? Should it be reworded?

 Do we need the definitions of 'error' and 'failure'? We don't rely on
these definitions in the text beyond their previously implied
definitions in the standard. If we do not need them, then it might be
good to drop them to reduce complexity.

 It was suggested that we try to clarify this paragraph with the
Rationale. Any suggestions?

 Should we go ahead and pull the Advice to implementors regarding the
return value of mpiexec into a separate ticket? Or should we keep it
in the document and pull it if we get pushback? (I think on the call
we decided the latter, but I forget now).

In the second Rationale paragraph. I moved the first sentence to
17.7.1. But I think we can drop the rest of the rationale. I do not
know if it is terribly instructive.

I updated the rational to account for the MPI_Reduce numerical
stability recommendation.
Rationale. The MPI_COMM_VALIDATE and MPI_ICOMM_VALIDATE operations
provide the MPI implementation an opportunity to restructure
collective communication patterns before the communicator is used by
the alive process. This may allow for improved collective performance
after process failure. It should be noted such optimizations might
change the consistency recommendation for MPI_REDUCE in the advice to
implementors in Section ??. It is strongly recommended that the
consistency recommendation hold for MPI_REDUCE between consecutive
collective activations of a communicator using a collective validation
operation (e.g, MPI_COMM_VALIDATE). (End of rationale.)

Note that I moved the Advice to users regarding libraries to here, per
the teleconf.

 Added back the Advice to users regarding the 'sync-barrier-sync'
semantic for MPI_File_validate.


On Wed, Sep 14, 2011 at 1:58 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
> I've made some changes we discussed on the phone this morning.  You can find the latest pdf here (or at the bottom of the "Modified run-through stabilization" page on the wiki):
> https://svn.mpi-forum.org/trac/mpi-forum-web/attachment/wiki/ft/run_through_stabilization_2/ft.pdf
> Here's a summary of the changes:
>  * Changed MPI_ERR_RANK_FAIL_STOP to MPI_ERR_PROC_FAIL_STOP (because a "rank" doesn't fail, a "processes" does)
>  * Fixed up usage of rank vs process in the chapter.
>  * Removed MPI_COMM_COLLECTIVES_ENABLED function because it returns local version of a global state which is meaningless for applications.
>  * Moved the definition of MPI_COMM_VALIDATE et.al. earlier in the section, and added a new subsection.
> Please look over my changes, especially how I rearranged the collectives section for the definition of MPI_COMM_VALIDATE, and let me know if they look OK.
> Thanks!
> -d
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi-ft-from-r907.diff
Type: application/octet-stream
Size: 12575 bytes
Desc: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20110915/3d1b847c/attachment-0001.obj>

More information about the mpiwg-ft mailing list