[Mpi3-ft] radical idea?

Thu Jul 21 14:10:57 CDT 2011

Comments in line.

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Darius Buntinas
Sent: Thursday, July 21, 2011 2:43 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] radical idea?

Another issue someone brought up was that communicators were immutable and our proposal would change that by adding process state info.  Of course communicators aren't exactly immutable currently (e.g., we can set error handlers), but I guess the process state info that we're proposing is expected to be much more dynamic than error handlers (so maybe we were making the communicators "too mutable"??).

[rich] we are getting into a world of computing that is more and more mutable - the questions is how to respond to such changes (process failure in this case) in a way that makes it convenient for the app to continue, and minimizes lost work.  There is NOT going to be a single way to do this for all apps - we need to provide a reasonable range of options.

I'll expand a little on what Dave said about thread safety.  The idea was that calling MPI_Comm_validate[_all] changed some implicit state associated with the communicator, i.e., the "snapshot" of failed processes, and so one thread could call get while another was calling validate, and the "new" flag of validate is only meaningful if you know the last time it was called, which you may not (easily) if it's multithreaded.

An alternative that was suggested, and that my previous email showed, was to return a handle to the "snapshot", so you could have multiple such "snapshots", and threads would be able to query them, etc., individually.  This removes the mutability concerns.
[rich] I believe this is an issue with atomicity, not different than any other issue with atomicity.  If one does things in a way that is not thread safe, you get what you ask for.  A library can protect such accesses, but can't prevent (actually does not want to prevent) users from making calls that are not thread safe.

However, we can't use these individual "snapshots" for setting processes to NULL, otherwise we'd have to pass the communicator, rank and "snapshot" to, e.g., MPI_Send in order for the library to know whether to treat the process as NULL.  So, the "snapshot" only keeps process UP/DOWN state.  To address the PROC_NULL semantics, I suggested the MPI_COMM_NULLIFY function that set PROC_NULL semantics for a process in a communicator.  This can be used independently of a "snapshot" (and the rest of the ft proposal, if it makes any sense to do so), so you can technically set a live process to NULL.  This functionality may be (is) more controversial, so I figure we can put this into a separate proposal.  Also, I think it may be easier to explain the concepts separately.

We were running out of time towards the end, so we weren't able to spend as much time on getting feedback on collectives, but there was some resistance to having ft collectives and whether it was useful to have a reduce with an empty value (vs. creating a new communicator and doing collectives on that).  So I don't know whether people just didn't get it and need to think about it more or whether people are fundamentally opposed to ft collectives.

[rich] I will come back to what I said before, for some cases the ft collectives make sense, and for others it does not.  It really depends on what the app is trying to do.  This is one reason we have been advocating all along two different sets of collectives - one that guarantees all will get the "same" answer (if all are supposed to), and a set that does not, but lets the user use the validate call to check for the state of the procs.  This was actually a request for the apps in the very early days of the FT discussions, I guess 3+ years ago, when we met at the Marriott O'Hare...

Rich

-d

On Jul 21, 2011, at 12:57 PM, Solt, David George wrote:

> I think that the main objection was not to the MPI_PROC_NULL semantics themselves, but the fact that querying and setting the state of a communicator is not thread safe.   One thread can ask how many ranks are in a given state on comm A and another rank can then change the state of the comm by setting a rank to MPI_PROC_NULL, invalidating the results of the first thread's query.   If we remove the ability to explicitly change the state of a rank, then notification can be handled in a more thread-safe way.   
> 
> Our argument was that separate threads should not be working with the same comm at the same time as this doesn't work well currently (e.g. two threads trying to call collectives on the same comm at the same time).   
> 
> In general, I think that if there was a way to do what we wanted without introducing 40? state quering/setting functions we would get less pushback.   Getting rid of the MPI_PROC_NULL state opens up several options to such a simplification.
> 
> [[ personal opinion: I think MPI_PROC_NULL semantics is a nice touch, though it is a guess how much it would actually get used vs. immediately calling MPI_Comm_split or someday calling MPI_Comm_restore. I'd be willing to drop it if it allowed us to move forward.  ]]
> 
> Thanks,
> Dave
> 
> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Graham, Richard L.
> Sent: Thursday, July 21, 2011 12:46 PM
> To: 'MPI 3.0 Fault Tolerance and Dynamic Process Control working Group'
> Subject: Re: [Mpi3-ft] radical idea?
> 
> Since I missed the meeting - what were the objections people had to the proc null semantics ?  Is the suggestion that the app explicitly deal with the missing rank all the time, or something else ?  What was the motivation ?  What were the problems they saw ?  What were the alternative suggestions ?
> 
> Rich
> 
> 
> 
> Sent with Good (www . good . com)
> 
> 
> -----Original Message-----
> From: 	Howard Pritchard [mailto:howardp at cray.com]
> Sent:	Thursday, July 21, 2011 01:08 PM Eastern Standard Time
> To:	MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject:	Re: [Mpi3-ft] radical idea?
> 
> Hi Darius,
> 
> If we want to get something about RTS into MPI 3.0 I don't
> think we have time to manage it as a set of smaller proposals.
> 
> If we can eliminate the state problem that bothered some at
> the last forum meeting that would be a good start.  Also,
> if we could simplify the proposal some by removing
> the PROC_NULL semantics, I would be in favor of that.
> 
> If we want to limit the use of RTS to a small number of use
> cases (like the NOAA example), then I could see deferring
> "repairable" communicators to 3.1.
> 
> Howard
> 
> Darius Buntinas wrote:
>> We could break the rts proposal into smaller ones:
>> 
>>  point-to-point:  local up/down checks; errors on sending to failed processes
>>  recognition/PROC_NULLification:  Add a function to set a rank in a communicator to 
>>        MPI_PROC_NULL
>>  fault-aware collectives:  collectives don't hang, but they're permanently broken once a 
>>        proc in the communicator fails
>>  "repairable" collectives:  validate_all; collectives can be reactivated after failure
>> 
>> I don't think anyone really objected to "point-to-point" or "fault-aware collectives".  We'll have to work on the others.
>> 
>> -d
>> 
>> 
>> On Jul 20, 2011, at 9:13 AM, Joshua Hursey wrote:
>> 
>>> I'll have to think a bit more and come back to this thread. But I wanted to interject something I was thinking about on the plane ride back. What if we removed the notion of recognized failures?
>>> 
>>> This was a point that was mentioned a couple times in discussion - that we have a bunch of functions and extra state on each communicator because we want to allow the application to recognize failures to get PROC_NULL semantics. If we remove the notion of recognized failures, then the up/down state on the group would be enough to track. So failed processes will always return an error regardless of if the failure has been 'seen' by the application before or not.
>>> 
>>> The state of a process would be able to change as MPI finds out about new failures. But we can provide a 'state snapshot' object (which was mentioned in discussion, and I think is what Darius is getting at below) to allow for more consistent lookups if the application so desires. This removes the local/global list tracking on each handle, and moves it to a separate object that the user is in control of. The user can still reference the best known state if they are not concerned about consistency (e.g., MPI_Comm_validate_get_state(comm, ...) vs MPI_Snapshot_validate_get_state(snapshot_handle, ...)).
>>> 
>>> Some applications would like the PROC_NULL semantics. But if we can convince ourselves that a library on top of MPI could provide those (by adding the proc_null check in the PMPI layer), then we might be able to reduce the complexity of the proposal by pushing some of the state tracking responsibility above MPI.
>>> 
>>> I still have not figured out the implications on an application using collective operations if we remove the NULL state, but it is something to think about.
>>> 
>>> -- Josh
>>> 
>>> On Jul 19, 2011, at 6:11 PM, Darius Buntinas wrote:
>>> 
>>>> The MPI_COMM_NULLIFY() function would effectively set the process to MPI_PROC_STATE_NULL.
>>>> 
>>>> In the proposal we had MPI_PROC_STATE_NULL, _FAILED and _OK.  I'm proposing separating NULL from FAILED and OK.  So the MPI_COMM_GET_STATE() function (and friends) would let you query the (locally known) FAILED/OK state of the process, while MPI_COMM_NULLIFY() (and friends) would let you set the process to NULL.  There would be essentially two state variables associated with each process: one indicating whether it's failed or not (let's call it LIVENESS), and the other whether it's has PROC_NULL semantics (call it NULLIFICATION).  The LIVENESS state is controlled by the MPI library while the NULLIFICATION state is controlled by the user.  The table below shows how these states would match up with the current proposal:
>>>> 
>>>> Current proposal state   LIVENESS   NULLIFICATION
>>>> -----------------------+----------+---------------
>>>> MPI_PROC_STATE_OK        OK         NORMAL
>>>> MPI_PROC_STATE_FAILED    FAILED     NORMAL
>>>> MPI_PROC_STATE_NULL      FAILED     NULL
>>>> <UNDEFINED>              OK         NULL
>>>> 
>>>> Notice that there's a combination possible that's not covered by the current proposal.  I'm not sure whether that's a useful state (or if we should disallow it).
>>>> 
>>>> We'd could add a function to set the NULLIFICATION state from NULL to NORMAL for completeness.
>>>> 
>>>> -d
>>>> 
>>>> 
>>>> On Jul 19, 2011, at 4:32 PM, Solt, David George wrote:
>>>> 
>>>>> This works for "reading" state, but has no way to set a processes state.  (Not sure how radical your trying to go here... is part of the change here that there would no longer be a MPI_PROC_STATE_NULL state?)
>>>>> Dave
>>>>> 
>>>>> -----Original Message-----
>>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Darius Buntinas
>>>>> Sent: Tuesday, July 19, 2011 3:17 PM
>>>>> To: Darius Buntinas
>>>>> Cc: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>>> Subject: Re: [Mpi3-ft] radical idea?
>>>>> 
>>>>> 
>>>>> Howard pointed out that I forgot to add a FREE operation:
>>>>> 
>>>>> MPI_PROC_STATE_FREE(state_handle)
>>>>>     INOUT: MPI_PROC_STATE state_handle
>>>>> 
>>>>> -d
>>>>> 
>>>>> On Jul 19, 2011, at 3:07 PM, Darius Buntinas wrote:
>>>>> 
>>>>>> MPI_COMM_GET_STATE(comm, state_handle)
>>>>>>    IN:  MPI_COMM comm
>>>>>>    OUT: MPI_PROC_STATE state_handle
>>>>>> and ditto for GROUP, FILE, WIN as necessary
>>>>>> 
>>>>>> MPI_GET_PROC_STATE_SIZE(state_handle, mask, size)
>>>>>>    IN:  MPI_PROC_STATE state_handle
>>>>>>    IN:  int mask
>>>>>>    OUT: int size
>>>>>> 
>>>>>> MPI_GET_PROC_STATE_LIST(state_handle, mask, list)
>>>>>>    IN:  MPI_PROC_STATE state_handle
>>>>>>    IN:  int mask
>>>>>>    OUT: int list[]
>>>>>> 
>>>>>> MPI_GET_PROC_STATE_NEW(state_handle1, state_handle2, state_handle_new)
>>>>>>    IN:  MPI_PROC_STATE state_handle1
>>>>>>    IN:  MPI_PROC_STATE state_handle2
>>>>>>    OUT: MPI_PROC_STATE state_handle_new
>>>>>> This gives newly failed processes in state_handle2 since state_handle1.
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>> 
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> -- 
> Howard Pritchard
> Software Engineering
> Cray, Inc.
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft