[Mpi3-ft] radical idea?

Graham, Richard L. rlgraham at ornl.gov
Thu Jul 21 13:51:31 CDT 2011


Comments in line

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Solt, David George
Sent: Thursday, July 21, 2011 1:58 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] radical idea?

I think that the main objection was not to the MPI_PROC_NULL semantics themselves, but the fact that querying and setting the state of a communicator is not thread safe.   One thread can ask how many ranks are in a given state on comm A and another rank can then change the state of the comm by setting a rank to MPI_PROC_NULL, invalidating the results of the first thread's query.   If we remove the ability to explicitly change the state of a rank, then notification can be handled in a more thread-safe way.   

[rich] So, again missing the context of the meeting, the functionality that I believe is needed is for an app to explicitly be able to recognize a failure has occurred, so that they can respond appropriately in the app.  For point-to-point communications, I suppose it really does not matter if the library returns an error all the time, or if it returns success, if proc null semantics are used.  When we move on to recovery, then the story will change - the library needs to know if it needs to restart communications - and this is intended to lead into that scenario.  For collectives, I believe that forcing the app to recognize the failure before proceeding is essential - the app explicitly is acknowledging that collective ops in the new configuration, and can decide it to proceed based on the new state of the communicator.

Our argument was that separate threads should not be working with the same comm at the same time as this doesn't work well currently (e.g. two threads trying to call collectives on the same comm at the same time).   

[rich] As for thread safety, this is not different than any other atomic operation - if you do things non-atomically, you get what you get.  So here what is atomic is the" state data".

In general, I think that if there was a way to do what we wanted without introducing 40? state quering/setting functions we would get less pushback.   Getting rid of the MPI_PROC_NULL state opens up several options to such a simplification.
[rich] Again, if the app does not mind ignoring errors, once a remote process has entered into the error state, this is ok. 

[[ personal opinion: I think MPI_PROC_NULL semantics is a nice touch, though it is a guess how much it would actually get used vs. immediately calling MPI_Comm_split or someday calling MPI_Comm_restore. I'd be willing to drop it if it allowed us to move forward.  ]]

[rich] I believe there are folks that will do both.  With comm split, every one is getting a new rank (well, at least some, it depends on who fails).  Some will want this, and others don't need it.  So giving a broad option of choices is good.

Rich

Thanks,
Dave

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Graham, Richard L.
Sent: Thursday, July 21, 2011 12:46 PM
To: 'MPI 3.0 Fault Tolerance and Dynamic Process Control working Group'
Subject: Re: [Mpi3-ft] radical idea?

Since I missed the meeting - what were the objections people had to the proc null semantics ?  Is the suggestion that the app explicitly deal with the missing rank all the time, or something else ?  What was the motivation ?  What were the problems they saw ?  What were the alternative suggestions ?

Rich



Sent with Good (www . good . com)


 -----Original Message-----
From: 	Howard Pritchard [mailto:howardp at cray.com]
Sent:	Thursday, July 21, 2011 01:08 PM Eastern Standard Time
To:	MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject:	Re: [Mpi3-ft] radical idea?

Hi Darius,

If we want to get something about RTS into MPI 3.0 I don't
think we have time to manage it as a set of smaller proposals.

If we can eliminate the state problem that bothered some at
the last forum meeting that would be a good start.  Also,
if we could simplify the proposal some by removing
the PROC_NULL semantics, I would be in favor of that.

If we want to limit the use of RTS to a small number of use
cases (like the NOAA example), then I could see deferring
"repairable" communicators to 3.1.

Howard

Darius Buntinas wrote:
> We could break the rts proposal into smaller ones:
> 
>   point-to-point:  local up/down checks; errors on sending to failed processes
>   recognition/PROC_NULLification:  Add a function to set a rank in a communicator to 
>         MPI_PROC_NULL
>   fault-aware collectives:  collectives don't hang, but they're permanently broken once a 
>         proc in the communicator fails
>   "repairable" collectives:  validate_all; collectives can be reactivated after failure
> 
> I don't think anyone really objected to "point-to-point" or "fault-aware collectives".  We'll have to work on the others.
> 
> -d
> 
> 
> On Jul 20, 2011, at 9:13 AM, Joshua Hursey wrote:
> 
>> I'll have to think a bit more and come back to this thread. But I wanted to interject something I was thinking about on the plane ride back. What if we removed the notion of recognized failures?
>>
>> This was a point that was mentioned a couple times in discussion - that we have a bunch of functions and extra state on each communicator because we want to allow the application to recognize failures to get PROC_NULL semantics. If we remove the notion of recognized failures, then the up/down state on the group would be enough to track. So failed processes will always return an error regardless of if the failure has been 'seen' by the application before or not.
>>
>> The state of a process would be able to change as MPI finds out about new failures. But we can provide a 'state snapshot' object (which was mentioned in discussion, and I think is what Darius is getting at below) to allow for more consistent lookups if the application so desires. This removes the local/global list tracking on each handle, and moves it to a separate object that the user is in control of. The user can still reference the best known state if they are not concerned about consistency (e.g., MPI_Comm_validate_get_state(comm, ...) vs MPI_Snapshot_validate_get_state(snapshot_handle, ...)).
>>
>> Some applications would like the PROC_NULL semantics. But if we can convince ourselves that a library on top of MPI could provide those (by adding the proc_null check in the PMPI layer), then we might be able to reduce the complexity of the proposal by pushing some of the state tracking responsibility above MPI.
>>
>> I still have not figured out the implications on an application using collective operations if we remove the NULL state, but it is something to think about.
>>
>> -- Josh
>>
>> On Jul 19, 2011, at 6:11 PM, Darius Buntinas wrote:
>>
>>> The MPI_COMM_NULLIFY() function would effectively set the process to MPI_PROC_STATE_NULL.
>>>
>>> In the proposal we had MPI_PROC_STATE_NULL, _FAILED and _OK.  I'm proposing separating NULL from FAILED and OK.  So the MPI_COMM_GET_STATE() function (and friends) would let you query the (locally known) FAILED/OK state of the process, while MPI_COMM_NULLIFY() (and friends) would let you set the process to NULL.  There would be essentially two state variables associated with each process: one indicating whether it's failed or not (let's call it LIVENESS), and the other whether it's has PROC_NULL semantics (call it NULLIFICATION).  The LIVENESS state is controlled by the MPI library while the NULLIFICATION state is controlled by the user.  The table below shows how these states would match up with the current proposal:
>>>
>>> Current proposal state   LIVENESS   NULLIFICATION
>>> -----------------------+----------+---------------
>>> MPI_PROC_STATE_OK        OK         NORMAL
>>> MPI_PROC_STATE_FAILED    FAILED     NORMAL
>>> MPI_PROC_STATE_NULL      FAILED     NULL
>>> <UNDEFINED>              OK         NULL
>>>
>>> Notice that there's a combination possible that's not covered by the current proposal.  I'm not sure whether that's a useful state (or if we should disallow it).
>>>
>>> We'd could add a function to set the NULLIFICATION state from NULL to NORMAL for completeness.
>>>
>>> -d
>>>
>>>
>>> On Jul 19, 2011, at 4:32 PM, Solt, David George wrote:
>>>
>>>> This works for "reading" state, but has no way to set a processes state.  (Not sure how radical your trying to go here... is part of the change here that there would no longer be a MPI_PROC_STATE_NULL state?)
>>>> Dave
>>>>
>>>> -----Original Message-----
>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Darius Buntinas
>>>> Sent: Tuesday, July 19, 2011 3:17 PM
>>>> To: Darius Buntinas
>>>> Cc: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>> Subject: Re: [Mpi3-ft] radical idea?
>>>>
>>>>
>>>> Howard pointed out that I forgot to add a FREE operation:
>>>>
>>>>  MPI_PROC_STATE_FREE(state_handle)
>>>>      INOUT: MPI_PROC_STATE state_handle
>>>>
>>>> -d
>>>>
>>>> On Jul 19, 2011, at 3:07 PM, Darius Buntinas wrote:
>>>>
>>>>> MPI_COMM_GET_STATE(comm, state_handle)
>>>>>     IN:  MPI_COMM comm
>>>>>     OUT: MPI_PROC_STATE state_handle
>>>>> and ditto for GROUP, FILE, WIN as necessary
>>>>>
>>>>> MPI_GET_PROC_STATE_SIZE(state_handle, mask, size)
>>>>>     IN:  MPI_PROC_STATE state_handle
>>>>>     IN:  int mask
>>>>>     OUT: int size
>>>>>
>>>>> MPI_GET_PROC_STATE_LIST(state_handle, mask, list)
>>>>>     IN:  MPI_PROC_STATE state_handle
>>>>>     IN:  int mask
>>>>>     OUT: int list[]
>>>>>
>>>>> MPI_GET_PROC_STATE_NEW(state_handle1, state_handle2, state_handle_new)
>>>>>     IN:  MPI_PROC_STATE state_handle1
>>>>>     IN:  MPI_PROC_STATE state_handle2
>>>>>     OUT: MPI_PROC_STATE state_handle_new
>>>>> This gives newly failed processes in state_handle2 since state_handle1.
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft


-- 
Howard Pritchard
Software Engineering
Cray, Inc.
_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft


_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft





More information about the mpiwg-ft mailing list