[Mpi3-ft] [Mpi3-rma] Fault Tolerance & RMA Discussion

George Bosilca bosilca at eecs.utk.edu
Wed Feb 22 23:29:10 CST 2012


Based on the discussion and opinions expressed during this call, the new FT proposal has been amended to include RMA operations. The FT model for windows is similar to the point-to-point mode: break nothing by default, provide tools to the user level to break and fix the issues. Thus, on group based operations only ranks directly involved in RMA operations with failed processes will be notified. A special function (MPI_Win_invalidate) is available to mark the window as improper to RMA operations, other wise (and this with few restrictions) further RMA not including the failed processes should proceed as expected in a failure-free execution. 

The current version if the FT proposal is attached to this email, and can be accessed in the working group wiki @ https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/User_Level_Failure_Mitigation.

We expect to have a final version of the proposal by Monday noon EST, as requested by the MPI_Forum to be able to have a first reading at the next meeting. We are looking forward to your comments.

  george.



On Feb 7, 2012, at 16:57 , Josh Hursey wrote:

> Attached are my notes from the RMA meeting this morning.
> 
> Thanks,
> Josh
> 
> 
> On Mon, Feb 6, 2012 at 2:45 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:
> We are going to meet from 10-11 am (Eastern) on Feb. 7 to continue our conversation. We will use the same call-in information as before.
> 
> Thanks,
> Josh
> 
> 
> On Thu, Feb 2, 2012 at 3:00 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:
> We made some really good progress on today's call. Attached are some notes that I took from the call.
> 
> At the end of the call there were a couple of items that we wanted to get a finer understanding of. As a result we are going to try to setup another teleconf.
> 
> Below is a doodle poll to pick a date/time:
>    http://www.doodle.com/kzmiknie8yz4wxkc
> 
> If you are interested in attending this teleconf, please fill out the poll by 2 pm Eastern on Monday, Feb. 6.
> 
> Thanks,
> Josh
> 
> 
> On Thu, Feb 2, 2012 at 10:01 AM, Josh Hursey <jjhursey at open-mpi.org> wrote:
> Just a reminder that we are meeting today at Noon Eastern to discuss RMA in the context of the fault tolerance proposal.
> 
> The Run-Through Stabilization proposal can be found attached to the ticket:
>   https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/276
>   https://svn.mpi-forum.org/trac/mpi-forum-web/attachment/ticket/276/FTWG-Process-FT-Draft-2011-12-20.pdf
> 
> We will be focusing on section 17.11 of that document. Note that this section does not currently explicitly account for the new RMA proposal, but we would like to remedy that for the next reading.
> 
> Thanks,
> Josh
> 
> On Wed, Jan 25, 2012 at 3:15 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:
> There was no one date/time that worked for everyone, but I chose a time that worked for most of the respondents. We will meet Thursday, Feb. 2 from 12-1 pm EST/New York to discuss this topic.
> 
> We can use the following teleconf information:
>   US Toll Free number: 877-801-8130
>   Toll number: 1-203-692-8690
>   Access Code: 1044056
> 
> Thanks,
> Josh
> 
> 
> On Mon, Jan 23, 2012 at 4:33 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:
> (Cross posted to both the RMA and FT MPI-3 listservs)
> 
> During the FT plenary session at the Jan. MPI Forum meeting it was recommended that some of the members of the FT group and the RMA group have a meeting to hash out the precise details of the FT semantics for the RMA chapter. So I would like to facilitate such a discussion, preferability in the next week (so we have time to fine tune things before the next forum meeting).
> 
> In general, we are trying to answer the question "How should RMA operations behave when a process failure occurs?" The feeling seemed to be that the current approach is ok (invalidating the window, forcing recreation/validation), but the statement that the memory exposed in the window is 'undefined' seemed excessive. The suggestion was to change the wording to something like "Only the memory associated with a window that was targeted by an operation that modified it is undefined after process failure in the group associated with the window." This lead to a considerable amount of debate in the meeting, so it was suggested that we take the discussion offline.
> 
> Below is a link to a doodle poll to find a good time for a teleconf. If you are interested in participating in this discussion, please fill this poll out by 2 PM Eastern on Wed. Jan 25 so we can set the date/time.
>    http://www.doodle.com/vd33va5h8iankega
> 
> Thanks,
> Josh
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> <RMA-FT-Teleconf-02-07-2012.txt>_______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120223/32eeba7d/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi3ft.pdf
Type: application/pdf
Size: 156070 bytes
Desc: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120223/32eeba7d/attachment-0001.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120223/32eeba7d/attachment-0003.html>


More information about the mpiwg-ft mailing list