[mpiwg-ft] 2013-12 MPI Forum Wrapup

Fri Dec 13 13:29:53 CST 2013

Hi All,

+1, this is the most successful FT presentation that we've had so far, and
we got lots of positive feedback.  Here are few notes I took during the FT
plenary, primarily items that were discussed or questions that were asked:

* Do we need a new error code in place of MPI_ERR_PENDING?
    * Right now, if there is an ANY_SOURCE receive in a request array
passed to e.g. MPI_Waitall, you need to scan the list of requests to see if
they are all MPI_ERR_PENDING in order to determine if a process failure may
have occurred.

* MPI_Comm_shrink should specify the process ordering (may already be
covered in the spec)
    * Should there be a key argument to MPI_Comm_shrink to allow the user
to specify ordering? (Martin)

* Can we query whether a communicator has been revoked?  Perhaps through a
communicator attribute? (Jim)

* Discussed getting the failed group uniformly at all processes
    * Two protocols, shrink and agree
    * Agree is faster when there are no intervening failures, otherwise
shrink is faster
    * Might be worthwhile to add a function in the future to achieve this,
(Jim)

* Need to verify sane interaction between endpoints, init/finalize, and FT
proposals (Martin)

 ~Jim.

On Fri, Dec 13, 2013 at 12:01 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:

> Now that the forum meeting has finished, I wanted to send a wrap-up email
> about how things went for those who couldn’t be there and to continue the
> discussion with those who were.
>
> We had a productive meeting on Monday and Tuesday within the working group
> where we discussed some of the concerns raised by some of our outside
> collaborators. I won’t go into all of the details as those were captured in
> the wiki page (
> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/2013-12-10), but the
> general result was that we seem to have addressed all of the concerns that
> have been raised so far. The biggest challenge is to continue to have
> discussions with people and evangelize the proposal.
>
> On Thursday, we had a plenary time with the full forum where we presented
> the latest version of the proposal. There were minor changes since the last
> time this was presented to the forum, but there was one more important
> change to MPI_COMM_AGREE that provides new functionality. Again, I won’t go
> into detail about how it works as that was covered in the talk and text,
> but it does solve one of the use cases raised by Rich that some users want
> to be able to continue using a communicator without revoking and reordering
> ranks. Now it is possible to use MPI_COMM_AGREE as a transactional style
> function to periodically agree on the remaining processes. The slides for
> the talks given by Aurelien and me should be posted on the web site soon (
> http://meetings.mpi-forum.org/secretary/2013/12/slides.php).
>
> The reaction from the forum was quite positive. There were plenty of
> questions, but from what we could tell, it seems like most attendees were
> largely receptive to the current version. The major contributing factors to
> this that we heard from the people we talked to at the end of the plenary
> were that they like the ability to “turn off” FT for systems where it is
> not needed (smaller scale, reliable hardware, etc.) and we also provided
> more concrete examples of how to use the proposal. There had been concern
> about the performance impact of this proposal on systems where it was not
> needed, but the ability to compile it out should make that better. Many
> people said they still need to take this back to their users now that they
> have a better understanding of what’s going on in the proposal. We’ll
> hopefully hear back from them before March if there are concerns on their
> end. I don’t think there were any major issues with any of the technical
> content of the presentation.
>
> Our current plan is still to bring this for a reading at the next meeting
> in March in San Jose and pursue votes at the following two meetings. One of
> the most requested things to show at that meeting is to have performance
> numbers, so we will try to have something ready by then. These will be
> easier if we have some application partners that we can use to generate
> these numbers so if you have some “real” apps that you can run with ULFM
> (even if it’s failure-free runs), that would be very helpful. The other
> thing we can all do is to talk to our collaborators and see if there are
> any concerns that they didn’t raise during the full meeting that might
> hinder passing the proposal
>
> Thanks for all of your work!
> Wesley
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20131213/ac2188a3/attachment-0001.html>