[mpiwg-ft] FTWG Action Items

Thu Dec 19 11:47:14 CST 2013

Wesley, 

This is all good ideas, 
I encourage you to create branches for new features like query, new error codes, etc (anything that’s not wording fix for current functions) so that it is easier to track what changes pertain to what and to merge (or not) when we take the final decision. 

I have added some more comments inline. 

Aurelien 

Le 19 déc. 2013 à 11:06, Wesley Bland <wbland at mcs.anl.gov> a écrit :

> Hi all,
> 
> Since we’re not going to have a call next week due to the break, I thought it would be good to try to get the action items from the previous meeting and surrounding discussions going. The first two are things that we should all discuss, the second two are more minor things that I’ve been taking care of, the last one is important, but will require more work on our part. Please take a look and provide feedback.
> 
> Thanks,
> Wesley
> 
> ---
> 
> * Add another error code to replace the use of MPI_ERR_PENDING (MPI_ERR_FAILURE_PENDING?) to prevent the requirement of scanning all return codes to find out if there has been a failure that left a request pending.
> 	* This is probably a good idea as it’s become clear that overloading MPI_ERR_PENDING is more confusing and adds overhead to what the user has to do to determine if their request is still valid or not. Coming up with the name is the only challenge here.
> 	* For now, I’ll change the text to say MPI_ERR_FAILURE_PENDING (23 characters < 30 character limit), but if we come up with something better, we can change it again.
Lets think a bit more, I’m for it on principle, but only after all side effects have been considered carefully (in particular with respect to multiple any-source pending in the same wait). I feel this will not be an issue, but better safe than sorry. 

> 
> * Function to query the status of a communicator (is it revoked or not).
> 	* If this were done via a communicator attribute, it would be the only one I’m aware of that provides information to the user that is set by the library. All others are stored by the user and retrieved by the user.
> 	* Do we want to add a new API to take care of this, or can we confirm that it’s ok to do this via an attribute?
> 
It really depends on what is the status-quo with respect to attributes. I’d rather avoid having an extra API just for that.

> * I believe the intention of the description of MPI_COMM_AGREE is to ensure that after the agreement, all failures are propagated to all processes in the communicator, which will result in the next MPI_COMM_FAILURE_ACK/GET_ACKED returning a complete group of failed processes as of the last call to AGREE. However, the text specifically say that all failures have been detected. I think this might be the wrong choice of words since we don’t talk about detection anywhere else in the chapter other than to say that we don’t specify anything about failure detection. The text here should be tweaked.
> 	*I’ve made changes to the text that I think make this more clear.
> 
Good, I’ll take a look. 

> * We should add an example of a way to get a uniform group of failed processes via MPI_COMM_AGREE and FAILURE_ACK/GET_ACKED. I think this would be a good example of the relationship between the functions and it’s been a common question that we’ve received from users.
> 	* I’ll work on drafting one of these.
> 
I have one already, I’ll dump it in there. 

> * We need to make sure that we are compatible with the endpoints and init/finalize proposals.
> 	* The next step for this is to probably read through their proposals and come up with a rough outline of what our text would need to add to cover these new concepts (if anything).
> 	* The endpoints text is here: https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/380
We may need to clarify what revoke does when called on one of the shadow handles. My take is that it should revoke all communicators with sharing that same cid, so as to remain global for the communicator at all ranks/endpoints. It doesn’t seem excruciatingly difficult to come up with something simple and workable on that front. 

> 	* I don’t think we have anything to look at for init/finalize. That proposal seems to have morphed a few times, but the current version is not about re-entrant MPI, but about a thread-safe Init/Finalize, which is completely orthogonal to our work. Probably nothing to do here.
I think this is a correct assessment. 

Aurelien 

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375