[Mpi3-ft] New FT API

Thu Aug 11 14:48:13 CDT 2011

Hi Josh,

Thanks for you comments, I made mine inline.

On Aug 10, 2011, at 4:29 PM, Josh Hursey wrote:

> Thanks for getting that new interface going. I have some notes below
> for discussion.
> 
> -- Josh
> 
> 
> - Should we rename MPI_COMM_GET_FAILED to MPI_COMM_GROUP_FAILED() to
> better match the MPI_Comm_group() signature?

I'd be OK with that.

> - Should we add an optional MPI_Info parameter to the
> MPI_Comm_get_failed() operation to allow for implementation specific
> optimizations - similar to what we had with the 'mask' in the previous
> proposal.

I looked through both the ticket276 document and the wiki but couldn't find a list of valid masks, so I'm not sure how it would be used.  The one mask I did see was "new", but that should be determined using MPI_GROUP_DIFFERENCE to avoid a race.  What were the other mask values we were considering?

> - What should we do for intercommunicators?
> A) Should we expand the signature of MPI_Comm_get_failed to return
> both the failed set for both the local and remote groups?
> MPI_Comm_get_failed(comm, local_grp, remote_grp)
> B) Should MPI_Comm_get_failed only return the remote group for
> intercommunicators, and force the user to MPI_Comm_group() to get the
> local group then call a MPI_Group_get_failed() to get the subset of
> failures?
> C) Add a MPI_Comm_get_failed_remote(comm, grp) that would return the
> failures in the remote group, and MPI_Comm_get_failed() would only
> ever return the failures in the local list.
> D) Something else?

I think we want C.  If we want to parallel mpi_comm_group, we can have mpi_comm_group_failed return the failed processes in the local group, then have the mpi_comm_remote_group_failed to return the failed processes in the remote group (like mpi_comm_remote_group).

> 
> - MPI_ANY_SOURCE: I think the user should have to pass in a group
> containing the list of failed ranks that it is allowing to participate
> even though they are failed. If there are other failures on the
> communicator that are not contained in this list then the
> MPI_ANY_SOURCE will fail as before. This protects the user from
> acknowledging more processes than it expects to, and avoids the thread
> safety issue mentioned on the wiki.

It returns the group of processes that were acknowledged.  The idea is that the user can compare the returned group with a group that it had previously to see if any new processes were detected (and acknowledged).

We could have the user pass in a group of processes to be acknowledged, but then the implementation needs to keep track of which failures are acknowledged and which aren't, as opposed to just a flag: anysource enabled/disabled.  And it still doesn't address the thread safety problem:  A thread can still come in and acknowledge failures that occurred between the time another thread checked/acknowledged failures and when it called a blocking wildcard receive.  

> - MPI_ANY_SOURCE: I do not have a preference on MPI_Comm_recognize
> versus MPI_Comm_enable_any_source. A 'recognized' rank can be defined
> as a rank that the application has acknowledged as failed to MPI, and
> understands that it will not participate in any group operations like
> MPI_ANY_SOURCE or collectives (though a special recognition operation
> is provided for collectives).

I think I'm now leaning towards MPI_Comm_enable_anysource (see above).

> - Nullify: I like keeping this separate since there is a question of
> whether or not it is useful to provide MPI_PROC_NULL semantics for P2P
> operations to failed peers. I think the signatures are fine. Maybe
> change them to MPI_Comm_group_nullified/MPI_Comm_group_nullify to line
> up with MPI_Comm_group - though I could see users getting those mixed
> up pretty easily.

OK

> - MPI_Comm_validate: For this operation, are the processes identified
> in the group 'recognized' for MPI_ANY_SOURCE? In the previous proposal
> the collective validate would 'recognize' the failed processes
> automatically. I would say that in this version we should -not- do
> this. I do not see much benefit in this given the flexibility of the
> new interface. Just a point of discussion, since it would be different
> semantics than those in the previous proposal.

Right, we can see validate as a collective function operating on a collective state (collectives enabled) and the recognize/anysource_enable function as a local function operating on local state (anysource enabled).

> - Thread safety: If we force the user to specify a group of processes
> that it wants to recognize [MPI_Comm_recognize(comm, input_group,
> output_group)]. The input_group would allow the user to specify those
> failed processes that it is wanting to acknowledge. The output_group
> would represent the full set of recognized ranks for this
> communicator. By requiring the user to specify an input_group this
> prevents the MPI implementation from adding more failed processes to
> the recognized group without the users knowledge. The user would then
> need to make sure that the threads know about how they are each
> managing the recognition status.

Yes, but it does not prevent another thread from recognizing failures before the thread can call receive(anysource):

Thread A                      Thread B
========                      ========
Recognize(comm, groupA)
----------- PROC X FAILS ----------------
                              Recognize(comm, groupB) // groupB contains proc X
Recv(comm, anysource)

Thread A may hang if the receive can only match a message that would have been sent by process X.

> - Thread safety: There is another question of what happens if:
>  ThreadA: MPI_Recv(comm, MPI_ANY_SOURCE)
>  --- Rank X fails ---
>  ThreadB: Notice a failure of Rank X
>  ThreadB: MPI_Comm_recognize(comm, {rankX})
> There is a race between when the error of Rank X failure is reported
> to ThreadA, and when ThreadB recognizes the failure. If ThreadB
> recognizes the failure before ThreadA is put on the run queue, should
> ThreadA return an error? or should it keep processing? I think it
> should return an error, and we should discourage the users from such
> constructs, but I could be convinced otherwise.

I think failure detection should be atomic wrt threads of a process.  As soon as the failure is detected, all anysource requests should be completed with an error, regardless of which thread detected it or which thread is waiting on the request.  Now it's possible that ThreadB recognizes the new failure before ThreadA returns from its receive.  But ThreadA can still check for new failures by doing a comm_group_failure then a group_difference with a group of failed processes it requested earlier.

> - 'notion of thread-specific state in the MPI standard?' From what I
> could find, I do not think there is a notion of thread specific state
> in the MPI standard. There is a concept of the 'main thread', but I
> think that is as far as the standard goes in this regard.

Yeah, the main thread was the only instance we could come up with here.  

I think what we really need to make this thread safe is another object that keeps track of whether the communicator is anysource enabled or of which processes have been recognized.  This object would then need to be passed in to all receive operations.  Each thread can manage its own object and enable/disable anysource receives as necessary.  Of course this means adding a new parameter to receive operations which I don't think would make the forum happy. 

Instead, by using thread-local storage, we make this object implicit.  If we want to make it implicit to avoid adding parameters to receive, without using thread-local storage, we run into the thread safety issues.

So the options I see are:
    1. Change the API and add a parameter to receive function
    2. Make the state implicit but don't use thread-local storage and have thread safety 
        problems
    3. Make the state implicit and use thread-local storage

I think 3 is the least evil of the three.

-d

> 
> 
> 
> 
> On Wed, Aug 10, 2011 at 4:42 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>> 
>> I've started a wiki page describing a new API based on feedback from the forum and comments during our last meeting.  It's still a work in progress, but please look over it and send me your comments, specifically on the "thread safety" section.
>> 
>> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization_2
>> 
>> Thanks,
>> -d
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft