[mpiwg-ft] 2013-12 MPI Forum Wrapup

Jim Dinan james.dinan at gmail.com
Fri Dec 13 14:28:50 CST 2013


Hi Aurelien,

Just to make sure there isn't confusion because of where comments were
inlined:

> * Do we need a new error code in place of MPI_ERR_PENDING?
>     * Right now, if there is an ANY_SOURCE receive in a request array
passed to e.g. MPI_Waitall, you need to scan the list of requests to see if
they are all MPI_ERR_PENDING in order to determine if a process failure may
have occurred.

IIRC, this item was one that we are supposed to look into.  There was
concern that one has to deduce that a failure occurred causing
MPI_Wait/test_some/any/all by scanning the statuses to see if all are
ERR_PENDING.  Something more definite, like setting status errors to a new
value (e.g. MPI_ERR_PENDING_FAILURE) would be clearer, but there is concern
on the FTWG's side about fitting this cleanly into the standard.

 ~Jim.


On Fri, Dec 13, 2013 at 2:40 PM, Aurélien Bouteiller
<bouteill at icl.utk.edu>wrote:

>
> Le 13 déc. 2013 à 14:29, Jim Dinan <james.dinan at gmail.com> a écrit :
>
> > Hi All,
> >
> > +1, this is the most successful FT presentation that we've had so far,
> and we got lots of positive feedback.  Here are few notes I took during the
> FT plenary, primarily items that were discussed or questions that were
> asked:
> >
> Jim, thanks for this summary of the key followup items.
>
> > * Do we need a new error code in place of MPI_ERR_PENDING?
> >     * Right now, if there is an ANY_SOURCE receive in a request array
> passed to e.g. MPI_Waitall, you need to scan the list of requests to see if
> they are all MPI_ERR_PENDING in order to determine if a process failure may
> have occurred.
> >
> > * MPI_Comm_shrink should specify the process ordering (may already be
> covered in the spec)
> >     * Should there be a key argument to MPI_Comm_shrink to allow the
> user to specify ordering? (Martin)
> >
> These 2 discussions items have been resolved in the room already: the spec
> is already well specified for rank ordering in shrink (same as split with
> well defined parameters); the key argument requires ranks to agree on
> something meaningful=> agree,shrink, so no performance advantage compared
> to shrink,split.
>
> > * Can we query whether a communicator has been revoked?  Perhaps through
> a communicator attribute? (Jim)
> >
> > * Discussed getting the failed group uniformly at all processes
> >     * Two protocols, shrink and agree
> >     * Agree is faster when there are no intervening failures, otherwise
> shrink is faster
> >     * Might be worthwhile to add a function in the future to achieve
> this, (Jim)
> >
> Additionally, it has been noted that there may be some performance
> advantages to standardize the  “MPIX_comm_replace” example. Another
> long-term item.
>
> > * Need to verify sane interaction between endpoints, init/finalize, and
> FT proposals (Martin)
> >
> >  ~Jim.
> >
> >
> > On Fri, Dec 13, 2013 at 12:01 PM, Wesley Bland <wbland at mcs.anl.gov>
> wrote:
> > Now that the forum meeting has finished, I wanted to send a wrap-up
> email about how things went for those who couldn’t be there and to continue
> the discussion with those who were.
> >
> > We had a productive meeting on Monday and Tuesday within the working
> group where we discussed some of the concerns raised by some of our outside
> collaborators. I won’t go into all of the details as those were captured in
> the wiki page (
> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/2013-12-10), but the
> general result was that we seem to have addressed all of the concerns that
> have been raised so far. The biggest challenge is to continue to have
> discussions with people and evangelize the proposal.
> >
> > On Thursday, we had a plenary time with the full forum where we
> presented the latest version of the proposal. There were minor changes
> since the last time this was presented to the forum, but there was one more
> important change to MPI_COMM_AGREE that provides new functionality. Again,
> I won’t go into detail about how it works as that was covered in the talk
> and text, but it does solve one of the use cases raised by Rich that some
> users want to be able to continue using a communicator without revoking and
> reordering ranks. Now it is possible to use MPI_COMM_AGREE as a
> transactional style function to periodically agree on the remaining
> processes. The slides for the talks given by Aurelien and me should be
> posted on the web site soon (
> http://meetings.mpi-forum.org/secretary/2013/12/slides.php).
> >
> > The reaction from the forum was quite positive. There were plenty of
> questions, but from what we could tell, it seems like most attendees were
> largely receptive to the current version. The major contributing factors to
> this that we heard from the people we talked to at the end of the plenary
> were that they like the ability to “turn off” FT for systems where it is
> not needed (smaller scale, reliable hardware, etc.) and we also provided
> more concrete examples of how to use the proposal. There had been concern
> about the performance impact of this proposal on systems where it was not
> needed, but the ability to compile it out should make that better. Many
> people said they still need to take this back to their users now that they
> have a better understanding of what’s going on in the proposal. We’ll
> hopefully hear back from them before March if there are concerns on their
> end. I don’t think there were any major issues with any of the technical
> content of the presentation.
> >
> > Our current plan is still to bring this for a reading at the next
> meeting in March in San Jose and pursue votes at the following two
> meetings. One of the most requested things to show at that meeting is to
> have performance numbers, so we will try to have something ready by then.
> These will be easier if we have some application partners that we can use
> to generate these numbers so if you have some “real” apps that you can run
> with ULFM (even if it’s failure-free runs), that would be very helpful. The
> other thing we can all do is to talk to our collaborators and see if there
> are any concerns that they didn’t raise during the full meeting that might
> hinder passing the proposal
> >
> > Thanks for all of your work!
> > Wesley
> >
> > _______________________________________________
> > mpiwg-ft mailing list
> > mpiwg-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
> >
> > _______________________________________________
> > mpiwg-ft mailing list
> > mpiwg-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
> --
> * Dr. Aurélien Bouteiller
> * Researcher at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 309b
> * Knoxville, TN 37996
> * 865 974 9375
>
>
>
>
>
>
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20131213/a8d9d4ba/attachment-0001.html>


More information about the mpiwg-ft mailing list